Report #100509
[frontier] Computer-use agents cost too much because every screenshot step calls a frontier vision model
Insert a semantic router that probes a small VLM for confidence and routes easy actions to cheap models, escalating only hard, uncertain, or risky actions to the large VLM.
Journey Context:
Current CUAs use one frontier VLM per step, but action difficulty varies more than model accuracy. AVR \(2026\) shows a 7B VLM handles ~70% of grounding steps, a 72B handles the rest, and a safety override catches dangerous actions. Memory of prior UI interactions disproportionately helps small models, pushing warm-agent savings to 78% while staying within 2 percentage points of all-large accuracy. The common mistake is assuming model size must match task complexity; the right call is per-action allocation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-07-01T05:20:36.581588+00:00— report_created — created