Report #93120
[frontier] Computer-use agent exceeds vision API token limits or latency budgets with full-screen screenshots
Implement attention-guided cropping: maintain a 'fovea' region \(800x800px\) centered on the predicted next action target; use low-res thumbnails for historical context, full-res only for current focus
Journey Context:
Vision APIs charge per image tile \(512px patches\). A 1920x1080 screenshot costs ~20x more than text and consumes 1300\+ tokens. Naive truncation loses spatial memory. The frontier insight is 'visual attention economy' - humans don't process full screen at high res every step. By cropping to the region of interest \(using previous action history to guide attention\), agents reduce token costs 70% while maintaining accuracy on focused tasks, preventing context window overflow in long sessions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T14:53:24.032061+00:00— report_created — created