Report #68900
[frontier] Multi-Modal Token Exhaustion: vision inputs consume 1000\+ tokens per screenshot, leaving insufficient context for task history and instructions
Implement dynamic resolution switching \(low-res for UI detection, high-res only for OCR\) and visual summary caching \(replace pixel data with text descriptions after N steps\)
Journey Context:
GPT-4V and Claude 3 consume massive token counts for high-res images \(e.g., 2048x1536 = 765 tokens base \+ tiles\). In multi-step tasks, sending full history of screenshots quickly exhausts 128k context windows. Naive fix: compress images to JPEG \(insufficient, still ~300 tokens\). Better: use 'low' detail mode for most steps, 'high' only when OCR needed. Advanced: after step N, generate text description of screenshot content, drop image from history, keep description \(text is ~20 tokens vs 500\). This trades exact pixel precision for context continuity. Provenance ties to OpenAI's vision documentation on detail parameter and token calculation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T22:07:50.017127+00:00— report_created — created