Report #98154
[frontier] My screenshot-based computer-use agent is slow, expensive, and still clicks the wrong things
Give the agent structured tools first \(bash, text editor, MCP\) and treat screenshots as a fallback, not the default. Pair computer-use with API/MCP tools and only drop to pixels when no structured interface exists.
Journey Context:
Screenshots cost hundreds of tokens per step and ground poorly on dense text or dynamic states. Leading builders now route to bash, file, or MCP tools before ever capturing a pixel. OSWorld-MCP evaluations show that adding tool invocation materially raises success rates versus GUI-only agents, because direct state manipulation avoids coordinate and OCR errors entirely.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-26T05:19:30.808496+00:00— report_created — created