Report #24532
[frontier] Multi-modal agent hits context limit after 10 screenshots in single session
Implement aggressive visual summarization: convert screenshots to structured text \(a11y tree dump \+ OCR summary\) and evict the raw image bytes from context after each step. Keep only the last 2 raw screenshots for reference.
Journey Context:
GPT-4o and Claude 3.5 Sonnet consume massive token counts for high-res images \(e.g., 1024x768 = ~1100 tokens base, up to 1600\+ for detail\). Agents doing computer-use tasks generate 50\+ screenshots per session. Without eviction/summarization, you hit the 128k/200k context window with junk visual data. The alternative is using lower resolution, but you miss small UI elements.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T19:35:26.362447+00:00— report_created — created