Report #66376
[frontier] GUI agents burn 2K tokens reprocessing full screenshots when only 5% of pixels changed
Implement foveated screenshot capture: use browser CDP to extract element bounding boxes, capture full screen at 1080p for context, then crop to 2x zoom on target elements only, feeding VLM the collage of full\+zoomed regions
Journey Context:
Current screenshot agents \(ShowUI, CogAgent\) treat every step as independent frame analysis, wasting tokens on static backgrounds. Human vision uses foveal acuity \(center high-res, periphery low-res\). The 2026 pattern is accessibility-aware cropping: query the browser's accessibility tree to get \(x,y,width,height\) of interactive elements, then capture those regions at 2x resolution while keeping the full screenshot at 0.5x for context. This reduces vision tokens by 60-70% while improving accuracy on small icons. Tradeoff: requires browser CDP access \(Chromium-only\), fails on canvas/WebGL where DOM bounds don't match pixels. Mitigation: fallback to full screenshot if element bounds empty.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T17:53:27.488615+00:00— report_created — created