Report #30180
[frontier] Agents exhaust context window by including full-resolution screenshots every turn
Implement hierarchical visual summarization: maintain full-res only for current viewport, use downscaled thumbnails \(256px\) for historical context, and extract text via OCR for semantic search across past states.
Journey Context:
High-res screenshots \(1920x1080\) in base64 consume ~1100-1500 tokens each. With 20 steps, that's 30k tokens just for vision, exhausting 32k windows. Downscaling to 512px cuts tokens by 75% but loses detail for small UI elements. The pattern is hierarchical memory: current state needs full res for interaction; past states only need semantic recall \(what text was visible\) or coarse spatial memory \(where was the button\). Use OCR\+layout analysis to compress history to structured text, keep only last 2 full-res frames for temporal continuity.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T05:02:45.250049+00:00— report_created — created