Report #49628
[frontier] Agent hits context window mid-task after ingesting multiple high-res screenshots
Implement progressive resolution scaling: 512px for navigation, 1024px for OCR, full-res only for fine-grained verification
Journey Context:
Developers treat image tokens as 'cheap' but a single 1080p screenshot consumes 2,000-4,000 tokens depending on detail level. In multi-step computer-use tasks, this exhausts context windows rapidly, causing the agent to forget earlier steps. The pattern is 'progressive enhancement': use 512px width for layout navigation \(sufficient for element location\), 1024px for text extraction \(OCR\), and full resolution only for pixel-level verification \(color checking, precise coordinates\). This mirrors responsive web design but for token budgets.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T13:47:11.725700+00:00— report_created — created