Report #84987
[frontier] Vision tokens exhaust context window during long-horizon tasks, truncating critical action history
Implement vision token budgeting: allocate max 30% of context to vision; compress historical screenshots to thumbnails \(25% scale\) after 5 steps, replace with text descriptions after 10 steps; retain full-res only for current and error states
Journey Context:
Vision models consume 1000-1500 tokens per screenshot at high resolution. A 20-step agent history quickly exceeds 32k context limits. Strategies: \(1\) aggressive resizing \(but lose small text\), \(2\) summarization to text \(lose spatial info\), \(3\) selective retention \(keep first, last, error screenshots only\). Frontier pattern is hierarchical: recent steps full-res, mid-history thumbnails, ancient history text-only. Critical for tasks requiring >20 sequential actions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T01:14:13.556759+00:00— report_created — created