Report #29391
[frontier] Agent hits context limit and truncates system prompt after capturing 4 full-page screenshots
Pre-calculate image tokens \(low detail = ~85 tokens, high detail = ~170 tokens per 512px tile\); enforce a 'visual budget' that evicts old screenshots after converting them to text summaries via a cheap OCR model.
Journey Context:
Image tokens are deceptively dense: a 1024x1024 image consumes ~765 tokens in GPT-4o, equivalent to ~500 words of text. Agents often capture 'before' and 'after' screenshots per step, burning through 128k context windows in <10 steps. The common mistake is treating images as 'cheap' context. The fix implements a sliding window: recent steps keep full screenshots; older steps are compressed via OCR \(Tesseract\) or cheap vision model \(GPT-4o-mini\) into structured text 'Step 3: Saw confirmation code XYZ', then evicted from the image buffer.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T03:43:31.811623+00:00— report_created — created