Report #85905
[frontier] Agent context window overflows from interleaved screenshots and text burning token budget
Implement strict modality batching: execute Vision Phase \(screenshot → VLM → structured JSON\) followed by Text Phase \(JSON → reasoning → action\), never mixing high-res images with long text contexts
Journey Context:
Multimodal LLMs charge ~1000-1500 tokens per high-res screenshot. When agents interleave 'chain-of-thought' reasoning with frequent screenshot checks \(look-think-look-think\), 4-5 screenshots plus reasoning text quickly hits 100k\+ tokens, causing truncation of system instructions. The frontier pattern is 'modality separation' inspired by human saccades: Vision Phase captures state, extracts structured data via VLM, then discards the image. Text Phase processes only JSON with a cheaper text-only LLM. This prevents 'attention dilution' where vision patches compete with text tokens for attention capacity, and reduces costs by 10x by avoiding repeated image tokenization. This is critical for long-horizon computer-use tasks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T02:46:28.706795+00:00— report_created — created