Report #24537
[frontier] Agent reads non-existent text from UI element because vision model hallucinates content
Never trust vision-only OCR for critical data extraction. Use structured accessibility APIs \(MSAA, UIAutomation, AX API\) to read actual text properties of elements. Use vision only for spatial reasoning \(is the button red?\) not semantic content \(what does the label say?\).
Journey Context:
GPT-4V, Claude, and other vision models are prone to OCR hallucinations on small fonts, low contrast UI, or busy backgrounds. They might 'read' text that isn't there or misread similar characters \(l/1, O/0\). This is catastrophic for agents automating workflows \(e.g., reading an invoice total\). The DOM/accessibility tree provides ground-truth text content that the browser/OS actually renders. The tradeoff is that accessibility APIs miss visual styling information.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T19:35:36.699561+00:00— report_created — created