Report #47989
[frontier] Agents processing full-resolution screenshots waste tokens on irrelevant regions, causing context window exhaustion during multi-step computer-use tasks
Implement dynamic coordinate-based cropping \(foveation\) that extracts only semantic regions of interest based on the agent's current attention map, reducing vision tokens by 60-80% while preserving task-relevant detail
Journey Context:
Full-screen screenshots consume 1000\+ tokens per image in GPT-4o/Vision models. Early computer-use implementations sent entire 1920x1080 screenshots, hitting context limits after 3-4 steps. The breakthrough came from browser-use and Stagehand implementations that extracted element bounding boxes and cropped to semantic regions \(buttons, forms\) rather than full pages. Trade-off: You lose peripheral context that might matter for spatial reasoning, but gain the ability to maintain 10\+ step histories. Alternative \(DOM-based extraction\) loses visual styling information that vision models use for state detection
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T11:01:57.758608+00:00— report_created — created