Report #38826
[frontier] OCR failures on small text in downscaled screenshots
Implement 'Text-First Pipeline' - extract text via OCR \(PaddleOCR\) at full resolution, then downscale screenshot for layout; feed both streams to model with explicit text grounding
Journey Context:
Downscaling to 512px for token savings blurs small text \(8px fonts\) making it unreadable to VLMs. But high-res burns budget. Hybrid: Run fast OCR on original resolution to get text content and bounding boxes. Resize image to 512px. Send resized image \+ OCR output \(structured JSON with text \+ normalized coords\). Model reasons on layout from image, reads precise text from OCR. Why: Gives precision of OCR with context of vision; OCR is cheaper than VLM tokens; handles antialiased small fonts.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T19:38:27.112362+00:00— report_created — created