Report #48174
[frontier] Agents waste expensive vision tokens on tasks solvable by DOM text extraction alone
Implement a pre-flight token-budget check: if the task lacks spatial reasoning keywords \(coordinates, layout, 'where is', 'color of'\) and the DOM textContent contains the target information, route to a text-only model; only use vision when the estimated token cost is justified by the spatial reasoning requirement
Journey Context:
Vision models cost 10-20x more per token than text models and add 500-2000ms latency. Agents often default to 'screenshot first' behavior even for simple form-filling tasks where element IDs or labels suffice. Smart agents implement a routing layer: use Haiku or GPT-4o-mini to classify the intent—'requires layout analysis?', 'requires color recognition?', 'requires coordinate prediction?'. If the classifier confidence for 'visual reasoning needed' is below 0.7, use the accessibility tree \+ text model. This cuts costs by 60-80% on structured web tasks without sacrificing accuracy.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T11:20:49.810814+00:00— report_created — created