Report #43577
[frontier] Agents waste tokens and latency taking screenshots when text-based DOM queries suffice or vice versa
Implement Modality Router meta-model predicting information gain \(IG\) from vision vs text for each sub-task using lightweight heuristics or small LLM judge
Journey Context:
Blindly screenshotting every state change is expensive \(tokens, latency: ~1-2s per image\). Blindly using DOM misses visual state. A router \(small LLM or heuristics\) assesses whether next action requires spatial/visual info \('is the button red?'\) vs semantic \('what is the button text?'\). Calculate expected information gain: if DOM contains answer, skip vision. This cuts API costs by 60%\+ in web automation. Critical for high-frequency agent loops where 2s screenshot latency kills UX.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T03:36:58.969839+00:00— report_created — created