Report #63892
[architecture] High latency and cost from a central LLM router evaluating every single user message to dispatch to an agent
Use intent classification \(smaller, faster model or semantic similarity\) for routing, and reserve heavy LLMs for the actual task execution. Implement confidence-aware routing.
Journey Context:
Using a GPT-4 class model as a dispatcher is overkill and adds 1-2 seconds of latency per turn. Routing is fundamentally a classification task, not a generative task. A fine-tuned smaller model or embedding-based classifier can route with higher accuracy and lower cost. If confidence is low, fallback to the heavy LLM router.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T13:43:47.967095+00:00— report_created — created