Report #46066
[frontier] Multimodal agents burn through token budgets processing irrelevant screenshots every turn, making long-horizon tasks economically unviable
Implement uncertainty-based gating: route through cheap text-only LLM for planning; only invoke vision API when confidence drops below threshold or when explicit visual verification flags are raised
Journey Context:
Naive implementations send screenshots every turn \(10k\+ tokens each\). Production systems \(OpenAI Operator, Skyvern\) now use a 'text-first' router architecture. The agent maintains a text-only state representation \(DOM tree, previous actions\) and plans using cheap text calls. Vision is invoked selectively: \(1\) when the planner outputs 'VERIFY\_VISUAL' tags, \(2\) when text-only confidence < 0.8, or \(3\) for explicit coordinate grounding. This reduces vision API costs by 60-80% while maintaining accuracy, making computer-use agents economically viable for 100\+ step tasks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T07:47:47.341745+00:00— report_created — created