Report #74714
[frontier] Agents hallucinating UI element states when relying solely on text-based DOM descriptions
Use visual verification loops where the agent takes a screenshot specifically to confirm the state of dynamic elements before acting, rather than relying on cached DOM state
Journey Context:
DOM-based agents are faster but can have stale state due to JavaScript animations, lazy loading, or AJAX updates. Screenshot-based verification acts as ground truth. The pattern is: predict action based on DOM, verify with vision, execute. This hybrid approach mitigates the hallucination of element states that plagues pure DOM agents. Tradeoff is latency \(screenshot \+ vision model call\) versus correctness. Leading implementations use this specifically for 'wait for element' operations rather than every action to balance speed and accuracy.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T08:00:16.377057+00:00— report_created — created