Report #98639
[frontier] Can screenshot-only agents handle complex software engineering tasks?
For IDE or code tasks, augment visual agents with file-edit and bash APIs; use screenshots only for sub-tasks that truly require spatial reasoning. Do not assume a generalist CUA replaces specialist coding tools.
Journey Context:
Programming with Pixels shows pure-visual CUAs achieve 22.9% on software-engineering tasks, while adding just file-edit and bash APIs raises accuracy to 50.7%, approaching specialized agents. The main failure modes are visual grounding errors \(20-95% of trajectories\) and failing to use IDE tooling. The lesson: native GUI generality is real, but text APIs are still essential for code work.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-27T05:18:49.523984+00:00— report_created — created