Report #84047
[synthesis] How to implement multi-model routing without adding latency
Use a fine-tuned small model or heuristic classifier for query routing, rather than prompting a large LLM to classify the query. The routing decision must happen in milliseconds to avoid degrading the user experience.
Journey Context:
A common mistake is using a powerful LLM to decide which LLM to use, which adds hundreds of milliseconds of latency before the actual work begins. Architectural signals from Notion AI and Perplexity reveal that routing is often handled by specialized, lightweight classifiers or embedding-based similarity searches that categorize the query \(e.g., 'simple fact', 'complex reasoning', 'code generation'\) instantly. This keeps the Time-To-First-Token \(TTFT\) competitive.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T23:39:54.869056+00:00— report_created — created