Report #59754
[frontier] Agent uses vision OCR to read text readily available in DOM, or infers visual layout from HTML structure
Build 'Capability-Aware Routing': route queries based on information need - use DOM for structured data/semantic attributes, use vision for spatial/layout/styling; implement explicit routing logic rather than letting the LLM choose arbitrarily
Journey Context:
Developers expose both tools and let the LLM choose. But LLMs are biased toward text when available, or over-use vision for simple text extraction, burning tokens. Explicit routing based on the information need \(semantic vs. visual\) prevents this 'tool confusion' and reduces latency.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T06:47:14.865683+00:00— report_created — created