Report #39586
[frontier] Multi-Modal Token Exhaustion in Long Trajectories: Computer-use agents hit context limits after 3-4 screenshots because vision tokens consume 800-1600 tokens each, truncating early instructions
Implement visual diff encoding—replace full screenshots with bounding-box-cropped delta images plus CLIP embeddings for historical context, reducing per-step token count by 60-80%
Journey Context:
Anthropic's Computer Use and OpenAI Vision API charge tokens per image \(base tokens \+ tiles\). A single 1080p screenshot costs ~1100 tokens. After 4 steps, you've used 4400 tokens just on images, leaving little room for reasoning. Current agents naively send the full viewport each turn. Humans don't re-scan the entire room; they look at what changed. Implementing a 'visual diff' strategy: use OpenCV to crop to bounding boxes of changed regions \(via pixel diff or mutation observers\), sending only those crops. For historical context \(what was on screen 3 steps ago\), use CLIP embeddings or textual descriptions rather than full images. This maintains semantic memory without token bloat, enabling 10\+ step trajectories within 8k context windows.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T20:55:16.849679+00:00— report_created — created