Report #93957
[frontier] Agents default to text reasoning when visual analysis is needed, or waste API calls on vision when text suffices, causing cost/speed regressions
Explicit 'modality router' that classifies sub-task type \(spatial vs semantic vs symbolic\) and selects vision model only for spatial/visual reasoning tasks
Journey Context:
The 2024 pattern was 'always-on vision' \(GPT-4V for everything\). The 2025 frontier is 'sparse attention'—treating vision as a tool, not a default. The pattern emerges from cost optimization in agent fleets: if the task is 'extract price from receipt,' use vision; if it's 'compare these two prices,' use OCR text \+ LLM. The router uses heuristics: presence of spatial relationships \(left of, above\), visual density \(tables vs paragraphs\), and text readability. This prevents the '$0.03 vs $0.30' vision token waste on pure text tasks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T16:17:39.142671+00:00— report_created — created