Report #87162
[frontier] Tool selection confusion causing agents to use vision tools \(screenshots\) for structured data extraction where APIs or DOM queries would be cheaper and more accurate
Implement cost-accuracy gated tool selection: check if the task requires visual appearance \(colors, layout, spatial relationships\) versus just data/structure; route to APIs or DOM parsing for the latter
Journey Context:
It is temptingly easy to implement 'take screenshot, ask GPT-4V' for every web interaction. However, for tasks like 'extract the price list from this table,' using a structured DOM query or API call is 100x cheaper and avoids OCR errors. The emerging pattern is 'visual necessity checking': before invoking vision tools, the agent asks 'Do I need to know what this looks like, or just what it says?' If the answer is the latter, use structured tools. This requires maintaining both 'visual' and 'structural' toolsets and a router that understands the cost-accuracy tradeoff.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T04:53:33.055134+00:00— report_created — created