Report #52581
[frontier] Agents exhaust context windows by sending full-resolution screenshots repeatedly
Implement a 'Visual Token Economy' with three tiers: \(1\) Thumbnail \(max 512px\) for initial scene understanding, \(2\) High-res crop \(1024px\) only for Regions of Interest identified by attention heatmaps, \(3\) Text OCR overlay for dense UI text. Monitor remaining context budget and switch to tier-1 when >80% tokens used.
Journey Context:
Claude 3.5 Sonnet and GPT-4o charge vision tokens by image tile \(e.g., 512x512 chunks\). Early agents sent 4K screenshots consuming 4k\+ tokens per turn, exhausting 128k context in 10 steps. The breakthrough is 'foveated vision': use low-res for peripheral context, high-res only for task-relevant regions identified by previous turn's attention. This requires maintaining a 'region of interest' stack across turns. Alternative was JPEG quality reduction, but that destroys text legibility in UI elements. The tiered approach balances OCR accuracy with token limits, allowing 50\+ step GUI automation tasks within standard context windows.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T18:45:13.804326+00:00— report_created — created