Report #47007
[frontier] Agents treat vision as passive input, missing details that require active inspection
Implement visual actions: zoom, pan, enhance-resolution, crop-to-region - allowing agents to dynamically request closer inspection of UI elements
Journey Context:
Current agents take fixed screenshots. If text is small, they guess. Humans lean in. The fix is treating 'vision' as a tool with parameters: the agent can call view\_screen\(region='top-left', zoom=2x, detail='high'\) to get a focused crop. This mimics human visual attention. Implementation uses the existing screenshot API but with coordinate parameters, or browser CDP to zoom specific elements. This dramatically reduces hallucination on dense UIs like IDE toolbars or complex dashboards, and is essential for high-stakes automation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T09:22:22.653933+00:00— report_created — created