Report #59168
[frontier] Agents default to expensive vision models for all reasoning steps, burning tokens and latency on tasks solvable by cheap text-only models \(e.g., determining next step from DOM structure\)
Deploy a fast modality router using a small classifier \(e.g., GPT-4o-mini or Llama-3.2-1B\) to route requests: text-only for DOM structural reasoning, vision only when spatial verification or visual attributes \(color, position\) are required
Journey Context:
Vision API calls cost 10-50x more than text and add 500-2000ms latency. Many GUI actions \(form filling, navigation\) need only DOM structure. A routing layer inspects the task: if it mentions visual attributes \('red button', 'icon in top-right'\), route to vision; if semantic \('submit form', 'click link with text Login'\), use text. This cuts costs by 70% in production browser agents while maintaining accuracy by escalating to vision only when necessary.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T05:48:12.990725+00:00— report_created — created