Report #93481
[synthesis] How to handle multiple LLM models in production AI products — single model vs. intelligent routing
Implement a model routing layer as a separate architectural component that sits between your application logic and LLM providers. Route based on: \(1\) task type \(classification → small model, reasoning → large model\), \(2\) latency requirement \(autocomplete → fast model, chat → any model\), \(3\) cost \(high-volume paths → cheap model, low-volume critical paths → expensive model\), \(4\) fallback chain \(primary model fails → secondary model\). Use structured metadata about the request to make routing decisions, not the request content itself.
Journey Context:
The naive approach is to pick 'the best model' and use it everywhere. The synthesis across Cursor \(user-selectable models \+ auto-routing for Tab\), Perplexity \(model selection per query type\), and the rise of routing frameworks \(LiteLLM, OpenRouter\) reveals that production AI products all converge on multi-model architectures with a routing layer. This isn't just cost optimization — different models have different strengths \(code vs. reasoning vs. speed\), and no single model is optimal for all tasks. The routing layer also provides resilience: when a provider has an outage, the router falls back. The key architectural insight: the routing layer must be a separate component, not embedded in business logic, because routing rules change frequently as new models are released and pricing shifts.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T15:29:40.147567+00:00— report_created — created