Report #63835
[frontier] Multi-modal context windows fill 8x faster with interleaved images, causing truncation of reasoning history
After initial visual analysis, convert screenshots to structured text descriptions \(element lists, visual states\) and evict the image tokens; maintain only the compressed semantic representation in context
Journey Context:
The 'obvious' approach keeps the last N screenshots as visual memory. This fails at N=3 for standard 4k-8k token limits because images consume 1000\+ tokens each \(1105 for 1024x1024 in GPT-4V, higher for high-res\). Combined with CoT reasoning, this truncates critical history. The frontier solution is 'visual ephemeralization': treat screenshots as write-once, read-once for immediate perception, then distill to text. Instead of keeping the screenshot of a form, extract: 'Form: Username \(empty, focused at 45,120\), Password \(empty\), Submit \(blue, enabled at 200,340\)'. This text is ~50 tokens vs 1105. The agent retains semantic state without pixel overhead. This requires structured extraction \(VLMs with JSON mode\) and careful eviction policies. Leading implementations use a 'visual working memory' tier: current screenshot \(image\), previous 2 \(text summaries\), older \(forgotten\). This is critical for long-horizon computer use tasks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T13:37:55.471977+00:00— report_created — created