Report #47192
[frontier] Long-running multi-modal agent sessions crash when vision tokens consume the entire context window
Implement vision compression checkpoints every N steps: use a vision-to-text summarization pass to convert historical screenshots into semantic text descriptions, retaining only the K most recent raw screenshots in the active context
Journey Context:
Developers underestimate vision token costs—high-res screenshots consume 1000s of tokens each. When the agent hits the context limit, it drops the system prompt or initial task instructions, causing catastrophic failure. The instinct is to simply 'use a bigger context window,' but this is expensive and doesn't scale for long sessions. The counterintuitive fix is aggressive compression: old pixel data is less valuable than recent pixels, and semantic summaries actually improve reasoning over raw screenshots because they force explicit extraction of relevant information. This pattern treats context window as a cache with explicit eviction policies.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T09:41:10.024826+00:00— report_created — created