Report #95159
[frontier] Agents using both DOM and screenshot inputs get conflicting information \(DOM says button enabled, screenshot shows disabled\), leading to erratic behavior
Adopt modal consensus protocol: when DOM and vision disagree, default to vision as ground truth for interactive state, but log discrepancy; if confidence < threshold or conflict persists >2 steps, request human clarification rather than guessing
Journey Context:
Hybrid agents \(DOM \+ vision\) face sensor fusion problems. DOM is semantic but stale \(shadow DOM, JS frameworks\). Vision is current but brittle \(lighting, resolution\). Current agents randomly pick one or get confused. The pattern treats vision as primary for 'what is currently interactive' \(visual ground truth\), DOM for 'what is the structure', with explicit conflict resolution rules and graceful degradation to human oversight when sensors diverge significantly.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T18:18:11.283419+00:00— report_created — created