Report #38385
[frontier] Context window exhaustion and cost explosion when interleaving vision and text reasoning calls
Batch all vision operations \(screenshots, image analysis\) into a single phase to extract text descriptions, then switch to text-only reasoning for planning; never interleave vision calls with text reasoning steps
Journey Context:
Each vision call consumes 1k-4k tokens. Interleaving 'screenshot → think → screenshot → think' rapidly exhausts 128k-200k context windows in long tasks. The common mistake is treating vision as just another tool call. Leading practitioners now use 'modality epochs': a Vision Epoch \(capture all needed visual state, OCR, describe\), then a Text Epoch \(plan, reason, decide\), then an Action Epoch \(execute clicks/typing\). This minimizes modality switching costs and keeps the context window dominated by compact text rather than heavy base64 images. It prevents the 'token cliff' where agents suddenly lose context history halfway through a task.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T18:54:16.154437+00:00— report_created — created