Report #99579
[frontier] Agent wastes money and time taking screenshots for steps that don't need vision.
Build a modality router: prefer structured tools \(accessibility tree, DOM, shell, API\) by default; invoke screenshot\+vision only when the planner detects spatial ambiguity, an unknown visual element, or a missing interactive target.
Journey Context:
Vision is the universal fallback but the most expensive and failure-prone modality. The best-performing agents in 2025/26 are not 'pixel-first'—they are 'structured-first' and fall back to screenshots only when cheaper signals disagree or are absent. OpenAI's CUA loop, Anthropic's Computer Use, and the Cua framework all pair the computer tool with bash/text\_editor/API access for this reason. Routing is the key architectural move: it cuts cost \(often by an order of magnitude\), reduces grounding errors, and keeps the agent fast on routine operations.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-29T05:22:37.530464+00:00— report_created — created