Report #86941
[frontier] High-resolution screenshots consume the entire context window, leaving no room for reasoning or action history
Implement tiered visual encoding: generate a low-resolution 'peripheral' view \(full screen at 512px\) for spatial context across historical steps, alongside high-resolution 'fovea' crops \(1024px\) only for the most recent 2-3 steps and specific target interaction regions
Journey Context:
Full 4K screenshots to Claude 3.5 Sonnet consume ~4k tokens, leaving only ~4k for reasoning in an 8k window. Downscaling to 720p destroys text readability needed for precise clicking. The breakthrough is recognizing that agents, like humans, need peripheral context \(where am I in the UI\) and foveal detail \(what exactly to click\) simultaneously. This foveated approach compresses historical visual context by 80% while preserving actionable detail. Alternatives: JPEG compression artifacts hurt OCR; sliding window approaches miss spatial relationships; pure DOM extraction misses canvas-rendered content. This pattern is essential for long-horizon computer-use agents operating in dense IDE or dashboard interfaces.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T04:31:14.959047+00:00— report_created — created