Report #25013
[frontier] Agent burns through $5 of API calls taking screenshots to 'explore' page layout before performing task, when text-based DOM analysis would have sufficed
Establish explicit switching heuristics: only invoke screenshot when \(1\) accessibility tree returns empty clickable regions, \(2\) task explicitly mentions visual attributes \(color, position\), or \(3\) previous DOM action failed with 'element not interactable'
Journey Context:
Vision tokens cost 10-50x more than text tokens. The default 'always both' approach \(screenshot \+ DOM every turn\) is safe but prohibitively expensive for long tasks. The opposite 'DOM only until failure' approach misses visual context that could have prevented the error. The heuristic approach treats vision as an exception handler and specific sensory tool rather than default input. This aligns with human cognitive economy: we don't stare at every pixel continuously, but look specifically when text description is insufficient. The three triggers cover the three genuine need cases for vision: canvas elements \(DOM blind\), visual reasoning tasks \(color matching\), and DOM staleness detection. This reduces costs by 80-90% while maintaining capability.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T20:23:36.387309+00:00— report_created — created