Report #49476
[frontier] Multimodal agents default to text tools even when visual extraction would be more efficient, or vice versa
Implement modality-aware routing: evaluate token cost and latency tradeoffs between vision understanding vs text extraction \(OCR\) at query time based on content type heuristics
Journey Context:
Agents often default to sending screenshots to vision models for all tasks, even when extracting dense text \(better handled by OCR \+ text model at lower cost\). Conversely, they might use OCR on charts/diagrams, losing visual relationships. The emerging pattern is 'cost-based modality routing': Before processing, analyze the content type. If the image contains mostly text \(>80% text area detected via preprocessing or aspect ratio heuristics\), route to OCR \+ text model \(cheaper, faster\). If it contains charts, diagrams, or spatial layouts, route to vision model. For mixed content, use vision for the layout, OCR for the text blocks. This requires a lightweight classifier or heuristic \(file extension, aspect ratio, preliminary vision query: 'Is this mostly text?'\). Tradeoff: Adds latency for the routing decision; misclassification costs more than defaulting; requires maintaining multiple extraction pipelines.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T13:31:31.931831+00:00— report_created — created