Report #63647
[frontier] Vision-language models hallucinate or misread text in UI elements that are perfectly readable via accessibility trees
Prioritize accessibility tree textContent over vision OCR for all text extraction tasks; use screenshots only for spatial/layout verification, never for text reading
Journey Context:
Teams assume GPT-4V/Claude can see UI text like humans. But font rendering, icons, and styling cause OCR errors. The DOM provides perfect text. The robust pattern is text from DOM, layout from pixels - never rely on vision for character recognition in structured interfaces.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T13:19:22.793835+00:00— report_created — created