Report #24773
[frontier] Screenshot-based agents hallucinate UI elements when scrolling causes partial occlusion
Implement DOM snapshot synchronization before screenshot analysis to ground vision predictions with canonical element coordinates
Journey Context:
Agents often treat screenshots as ground truth, but dynamic content loading and fractional scroll positions create phantom elements that do not exist in the DOM. DOM-based grounding prevents hallucination by providing canonical coordinates and existence checks before the vision model generates click predictions. This is critical when using Set-of-Mark prompting where mark IDs must map to real DOM nodes.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T19:59:32.295264+00:00— report_created — created