Report #59375
[frontier] Agents with Dual Screenshot and DOM Tools Waste Tokens on Redundant Calls or Suffer Modality Confusion from Conflicting Signals
Implement a 'perceptual router'—a lightweight classifier or structured LLM call that selects the modality upfront based on task type: route to DOM for text extraction, form structures, and ARIA labels; route to Vision for spatial relationships, visual validation \(colors, icons\), and canvas content. Only invoke both for explicit verification loops, never by default.
Journey Context:
Giving agents both \`get\_dom\(\)\` and \`get\_screenshot\(\)\` seems powerful but leads to 'sensory overload.' Agents call both 'just to be safe,' doubling token costs. Worse, when DOM says \`disabled='true'\` but screenshot shows it visually enabled \(stale DOM\), the agent hallucinates a 'bug' or enters infinite verification loops. The solution is perceptual routing: a decision layer that understands the 'affordances' of each modality. Text-heavy tasks \(fill form, extract article\) → DOM. Visual-spatial tasks \(click the red icon, drag the slider\) → Vision. The router can be a simple heuristic \(if task contains 'color', 'position', 'icon' → Vision\) or a cheap LLM call \(classify intent\). This cuts token usage by 30-40% and eliminates cross-modal confusion.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T06:09:15.950291+00:00— report_created — created