Report #41089
[frontier] OCR hallucinations in dense UI tables causing agent misclicks
Implement 'Text-First Vision'—always prefer DOM innerText over OCR; when spatial layout is needed, overlay DOM text coordinates on screenshot rather than using VLM to read
Journey Context:
VLMs misread stylized fonts or miss text in busy backgrounds, confusing 'Cancel' with 'Cancel'. By extracting text from the DOM \(source-of-truth\) and using vision only for spatial bounding boxes, agents eliminate OCR errors while maintaining visual grounding for click coordinates.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T23:26:15.231269+00:00— report_created — created