Report #95551
[frontier] OCR Hallucination Cascade in Terminal and IDE Interfaces
Enforce text grounding verification by requiring vision models to output OCR results as \[text, bounding\_box\] tuples, then verify the bounding box contains the claimed text via a secondary pixel-level check or accessibility tree cross-reference before including in reasoning.
Journey Context:
Vision models \(GPT-4V, Claude\) confidently misread monospace terminal text—'docker build' becomes 'docker bull', 'npm install' becomes 'npm install' with wrong character. Without grounding, model rationalizes the error \('bull' is a valid docker command? No, but model invents reasoning\). Common mistake: trusting OCR without spatial verification. Alternatives: Using accessibility tree for text \(misses canvas/terminal apps\), forcing exact string matching \(too rigid\). Right call: Grounding—forcing model to specify where text is located—allows verification. If bbox pixels don't match claimed text, reject and retry or use accessibility tree fallback.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T18:57:35.883044+00:00— report_created — created