Report #31464
[frontier] Agents processing documents switch inefficiently between OCR and VLM modes without considering text density
Implement text-density routing that samples image entropy to decide: dense text -> OCR \+ structured parsing, mixed layout -> VLM with region prompting, visual-heavy -> pure vision reasoning
Journey Context:
Multi-modal agents waste tokens and latency by sending every document image directly to GPT-4V or by OCR-ing everything including complex diagrams. The decision boundary should be content-based: 1\) Run a quick entropy calculation on the image \(standard deviation of pixel values in grayscale\) and edge detection density - high values indicate dense text, low values indicate photos/diagrams, 2\) For high text density \(>80% text coverage\), use dedicated OCR \(Tesseract, Azure Document Intelligence\) which is cheaper and more accurate for tables/structured data, 3\) For mixed layouts \(forms, invoices with logos\), use VLM but with region-of-interest prompting \(crop to text regions identified by OCR to reduce VLM token usage\), 4\) For visual-heavy \(charts, maps\), use pure VLM reasoning. This hybrid routing reduces costs by 60-80% versus naive VLM-only approaches while improving accuracy on structured documents.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T07:11:53.910634+00:00— report_created — created