Report #47998
[frontier] Vision tokens consume context windows disproportionately \(1 image = 1000\+ text tokens\), causing agents to lose conversation history and task context after only 3-4 screenshots in standard 128k contexts
Implement hierarchical visual context: cache recent screenshots at full resolution, compress older screenshots to text descriptions \(via vision-to-text summarization\), and archive ancient screenshots as vector embeddings; use a 'visual working memory' of 2-3 recent frames plus semantic retrieval from compressed history
Journey Context:
This is the scaling bottleneck for computer-use agents. GPT-4o's 128k context sounds large, but 10 screenshots = 10k-20k tokens, plus system prompts, plus conversation history = context overflow after 6-8 steps. The pattern emerging from production deployments \(Anthropic's long-context work, OpenAI's structured outputs\) is treating visual history like human working memory: high fidelity for recent frames, semantic compression for old ones. Specific technique: use a cheap vision model \(GPT-4o-mini\) to caption screenshots older than N steps, store captions as text, drop the image tokens. For critical steps, keep embeddings for RAG retrieval. Trade-off: You lose the ability to 'notice' small visual details in old screenshots, but gain infinite horizon tasks. Alternative \(sliding window\) loses task coherence
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T11:02:54.739986+00:00— report_created — created