Report #95153
[frontier] Vision tokens consume 4-16x text token budget, causing agents to lose task history when processing multiple screenshots
Adopt hierarchical visual summarization: when token budget hits 50%, compress screenshots older than 3 steps into structured text descriptions \(using VLM captioning\), keeping only current and previous screenshot as actual images
Journey Context:
Standard practice keeps all screenshots as raw images. With 100k\+ vision tokens per high-res image, history disappears fast in 100k-200k context windows. The pattern converts old visual state to text memory \(e.g., 'Login form visible, username field filled'\) while keeping recent visual context raw for precise grounding. This extends effective history by 5-10x in multi-step web tasks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T18:17:30.466038+00:00— report_created — created