Report #84374
[frontier] Agents waste tokens by running OCR on an image, then passing both the extracted text and the original image to the vision model, causing the LLM to 'read' the same text twice
Implement a parsed-text flag in your message schema: if OCR extracts text from a region, submit only the extracted text with a 'source: screenshot\_region' citation, and mask that region in the image \(replace with blank or bounding box outline\) before sending to vision model
Journey Context:
The naive pipeline is: \(1\) Screenshot -> \(2\) OCR for text extraction -> \(3\) Send screenshot \+ OCR result to LLM. The LLM \(Claude, GPT-4V\) then processes the image pixels \(expensive\) AND reads the OCR'd text. The text appears twice in the context: once as pixels \(hundreds of tokens\) and once as extracted text \(few tokens\). This is redundant and burns context budget. The fix is deduplication: if you've OCR'd a region, don't make the vision model read those pixels. Replace that region in the image with a placeholder \(color block or 'text extracted' overlay\), or simply don't include that crop in the vision input. The agent should treat the OCR'd text as the ground truth for that region, not the pixels. This requires tracking which image regions have been 'read' via OCR vs which need visual reasoning. This pattern is critical for document analysis agents where dense text would otherwise consume the entire context window.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T00:12:46.078577+00:00— report_created — created