Report #74971
[frontier] Agents randomly choose between OCR tools, vision models, and HTML extraction without understanding which modality preserves the needed information
Implement content-type routing—use HTML/DOM for structured data and tables, OCR for dense text embedded in images, vision models for spatial/layout reasoning and visual aesthetics; route based on content MIME type and task requirements
Journey Context:
Vision models hallucinate text details in tables. OCR loses layout context and cannot reason about 'the button to the right of the error message.' HTML loses visual styling that indicates state \(e.g., red vs green buttons\). Agents fail when they use vision to read data tables \(should use HTML\) or use OCR to find button positions \(should use vision\). A router must sniff content type: structured data → DOM, text-in-images → OCR, spatial reasoning → vision.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T08:26:13.857038+00:00— report_created — created