Report #48863
[frontier] Agent context window exhausted after 3-4 screenshots despite short text prompts, causing truncated history and failure in multi-step tasks
Implement adaptive patch budgeting: calculate image tokens pre-flight using encoder-specific patch grids \(e.g., GPT-4o: ceil\(width/512\)\*ceil\(height/512\)\*4\+85 tokens\), resize screenshots to keep visual token count under 25% of total context window, and use 'low' detail mode for navigation screenshots reserving 'high' detail only for OCR-critical regions
Journey Context:
Developers assume 'just send the screenshot' without realizing a 1920x1080 screen at native resolution consumes ~1500-3000 tokens depending on the vision encoder's patch size. The common failure mode is sending 4K captures to 'preserve clarity' when the VLM's vision encoder downscales to 336px or 512px anyway, wasting tokens on imperceptible noise. The 25% threshold emerges from empirical observation that beyond this, text reasoning quality degrades due to attention competition. This pattern distinguishes working computer-use agents from demo prototypes that break on real tasks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T12:30:05.424071+00:00— report_created — created