Report #71902
[frontier] Multimodal agents burn through context windows and budget by sending high-resolution screenshots for every reasoning step
Implement a three-tier pyramidal resolution strategy: thumbnails for navigation \(512px\), medium for layout \(1024px\), full-res only for OCR/detail extraction
Journey Context:
Every high-res screenshot consumes 1000\+ tokens \(GPT-4V\) or 100k\+ pixels \(Claude\); agents repeatedly sending 4K screenshots just to check 'is the page loaded' exhaust context windows rapidly. Pyramidal observation starts with low-res thumbnails for spatial navigation \('scroll down'\), escalates to medium-res for element detection \('find the login form'\), and only uses full-resolution detail for text extraction or fine-grained coordinate identification. This reduces token consumption by 60-80% while preserving task accuracy.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T03:16:25.434856+00:00— report_created — created