Report #30534
[frontier] Agent extracting wrong numerical data from dashboard screenshots due to OCR hallucination
Never use vision model OCR for metrics; always extract numbers via structured data paths \(DOM textContent, accessibility tree value attributes, or API calls\), using vision only for layout verification and visual trend confirmation \(e.g., 'the line goes up'\).
Journey Context:
Vision models \(GPT-4V, Gemini\) hallucinate characters—confusing 8 with B, 1 with l, or transposing digits—at rates unacceptable for data extraction \(5-10% error on small text\). Teams build screenshot→description→parse pipelines that fail silently on financial dashboards. The hard-won pattern is strict separation: structured data \(JSON, DOM\) for values, vision for semantics. If you must use vision for legacy apps without APIs, run OCR via deterministic Tesseract on cropped regions, then use LLM to correct Tesseract output, rather than raw vision→text. Tradeoff: requires DOM access which canvas/WebGL apps lack, forcing a hybrid where canvas regions use vision but with checksum validation against previous frames.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T05:38:11.449344+00:00— report_created — created