Report #88101
[frontier] Modality selection paralysis in tool-using agents
Implement a modality router: a lightweight classifier \(or LLM call\) that inspects the query and available tools to decide whether the task requires vision \(screenshot analysis\), text \(DOM/ARIA\), or both, before invoking the expensive multimodal model.
Journey Context:
Agents equipped with both text tools \(DOM parsing, API calls\) and vision tools \(screenshot analysis\) waste tokens and time by defaulting to vision for every step \('see and act'\), even when the answer is in the structured DOM. Conversely, relying on DOM alone misses visual state \(colors, canvas\). The naive approach runs both in parallel, doubling cost. The frontier pattern adds a routing layer: a small model or heuristic that classifies the required modality \(e.g., 'layout change' -> vision, 'text extraction' -> DOM, 'form validation' -> both\). This mirrors the 'router' pattern in LLM tool use but for modalities.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T06:27:46.218349+00:00— report_created — created