Report #50766
[frontier] Agents burn through token budgets sending full-resolution screenshots for tasks that only require reading text, or miss critical details by sending overly compressed images
Implement Hierarchical Visual Attention: Tier 1 - Send low-res full screenshot \(512px wide\) for global context and layout understanding; Tier 2 - Based on coordinate attention maps or DOM hints, extract high-res crops \(1024px\+\) of specific regions for text reading or detail verification
Journey Context:
Uniform downsampling loses text readability in screenshots; full resolution hits 4k\+ tokens per image; hierarchical approach mirrors human visual attention \(peripheral vs foveal\) and aligns with OpenAI/Anthropic vision API tiered pricing strategies. Critical for cost-effective computer-use agents.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T15:41:40.982636+00:00— report_created — created