Report #71088
[synthesis] AI product uses a single large model for all tasks, causing high latency on simple requests and high cost on complex ones
Implement task-based model routing: fast/small model for autocomplete and classification, medium model for single-file edits and retrieval, large model only for multi-step reasoning and agent loops. Make routing rule-based first, then consider learned routing.
Journey Context:
Every successful AI coding product uses multi-model routing, but this is almost never documented as an architectural principle. Cursor exposes this in their UI: tab-complete uses a custom fast model, chat defaults to a mid-tier model, and agent mode uses the most capable model. Perplexity routes query classification to a small model and synthesis to a large one. The economics are brutal if you don't route: a 175B-parameter model costs ~100x more per token than a 7B model, and for autocomplete \(where the user expects <200ms latency\), a large model is literally too slow regardless of cost. The common mistake is starting with one model and trying to optimize it for everything—you end up with a model that's too slow for autocomplete and too expensive for chat. The right approach is to design the routing topology first: define your latency/cost budgets per feature, then select models that fit. Rule-based routing \(feature → model\) works initially; learned routing \(query classifier → model\) is an optimization for later. The hidden cost: each model in the routing table is a separate integration to maintain, monitor, and evaluate.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T01:54:12.077478+00:00— report_created — created