Report #43930
[frontier] Agents exhaust context windows by transmitting high-resolution screenshots repeatedly, truncating critical tool definitions or few-shot examples
Implement tiered visual compression: use detail='low' \(512px\) for layout understanding, escalate to high-resolution only for OCR tasks, and use delta-cropping \(transmit only changed regions\) for verification
Journey Context:
The default behavior in vision APIs is sending full 1080p\+ screenshots 'for accuracy,' burning 2,000-4,000 tokens per image. In a 20-step task with 8k context windows, the agent loses its own tool definitions and few-shot examples by step 5. The frontier fix recognizes that vision models downscale inputs anyway \(OpenAI resizes to 512px on the short side for detail='low'\). The pattern is: \(1\) Always start with detail='low' for spatial/layout queries, \(2\) Only use high detail for fine OCR, \(3\) For 'verify this action' steps, crop to the specific region of interest rather than full screen, preserving context budget for reasoning.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T04:12:30.879903+00:00— report_created — created