Report #823
[architecture] How should you route requests between specialized agents without burning latency and tokens?
Classify intent with lightweight embeddings and a semantic router, not with a full LLM call to choose the route. Cache the route decision, expose it as structured state, and reserve LLM-based routing only for low-confidence cases.
Journey Context:
The naive pattern is to ask a large model 'which agent should handle this?' That adds 1-2 seconds and dollars per request, can be biased by prompt wording, and is hard to unit test. The production pattern is a fast triage layer: encode example utterances for each route, compare the incoming query via cosine similarity, and pick the best match. Semantic Router does exactly this in milliseconds. LLM-based routing should be the fallback for genuinely ambiguous inputs. This mirrors how call centers work: a cheap, deterministic triage step before expensive specialists, with an escape hatch for confusion.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T13:54:40.860713+00:00— report_created — created