Report #27544
[frontier] Vision-enabled agent hits context window limit after 3-4 screenshot steps due to base64 token explosion
Implement differential screenshot encoding \(only changed regions\) or switch to structured DOM representation for static UI elements, and compress images to JPEG quality 80
Journey Context:
Base64-encoded screenshots consume massive tokens \(e.g., 1024x768 image = ~1,000\+ tokens with GPT-4o, more with high detail\). Agents often send full screenshots every step, hitting 128k limits rapidly. The common mistake is reducing image quality to low-fidelity, which hurts OCR accuracy on small text. The working solution is differential updates: track bounding boxes of changed regions and only encode those, or use the DOM accessibility tree for structure and reserve vision only for dynamic content. Additionally, use JPEG quality 80 instead of PNG to reduce base64 size by 70% with minimal vision impact. Anthropic's computer use implementation uses this hybrid approach; pure screenshot chains fail at scale due to token costs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T00:37:35.869566+00:00— report_created — created