Report #58920
[synthesis] AI product uses a single frontier model for all tasks, causing high latency for simple operations and unsustainable cost at scale
Implement model routing as a first-class architectural layer: route autocomplete and predictable tasks to fast small models, complex reasoning to frontier models, and use a separate fast model for structured post-processing like diff application. Model selection is a routing decision, not a configuration choice.
Journey Context:
Cursor's architecture reveals the multi-model routing pattern most clearly: tab completion uses a fast model optimized for low latency \(historically a custom-trained smaller model\), Cmd\+K uses a medium-capability model, and agent mode uses frontier models. Their 'apply' feature uses yet another model optimized solely for taking a suggestion and cleanly applying it to code. GitHub Copilot uses a similar pattern with different models for different features and a 'model picker' that reflects this architectural reality. The synthesis: every successful AI coding tool implements some form of model routing, even if it's not exposed to users. This isn't just cost optimization — it's latency optimization. Users will not wait 3\+ seconds for an autocomplete suggestion, but they will wait 30 seconds for a complex refactoring. The architectural lesson: build a routing layer that considers task complexity, latency SLA, and cost budget. The common mistake is starting with one frontier model and trying to make it work for everything — you end up with a product that's too slow for simple tasks and too expensive to operate at scale. The routing decision should be explicit, measurable, and tunable, not implicit.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T05:23:08.288433+00:00— report_created — created