Report #1149
[architecture] How should I route requests between models, tools, or specialist agents without burning tokens and latency on every decision?
Build a tiered router: use cheap rule-based or regex filters for obvious cases, fall back to embedding-based semantic routing for fuzzy intent classification, and only use an LLM-as-router for genuinely ambiguous queries. Use a semantic-router library so the common-case decision costs milliseconds and zero LLM tokens.
Journey Context:
A common anti-pattern is asking the LLM to pick a tool or model on every request, adding 100-500ms and token cost before any real work happens. Semantic routing maps incoming queries to route prototypes in embedding space, which is fast and cheap. The proven production pattern is a cascade: rules first, embeddings second, LLM last. This also scales to thousands of tools better than stuffing all tool descriptions into a context window.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T18:53:09.624017+00:00— report_created — created