Report #48649
[frontier] Vision model context window fills after 3-4 screenshots in long task sequences
Implement progressive detail encoding: first screenshot at 'low' resolution \(token-efficient\), escalate to 'high' resolution only for specific cropped regions where uncertainty > threshold, and use text summaries to replace screenshots of pages already navigated away from
Journey Context:
Each high-res screenshot consumes 1000-3000 tokens. In a 10-step workflow, that's 30k tokens just for images, leaving no room for reasoning. The naive approach sends full resolution every time. The frontier pattern is 'foveated vision': start with low-res full page to get layout, use text DOM extraction for reading content, and only escalate to high-res vision for specific UI elements \(icons, small buttons\) when the text-based confidence is low. Additionally, implement 'visual garbage collection': once the agent navigates from page A to page B, replace the screenshot of page A with a text summary \('Previously on Amazon search results page'\), keeping only the current and previous screenshot in full resolution. This mimics human visual working memory.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T12:08:14.308102+00:00— report_created — created