Report #94642
[synthesis] Choosing one LLM for all coding-agent features causes latency-cost-quality mismatch across tasks
Architect a multi-tier model routing layer: use a sub-100ms local/small model for inline autocomplete, a mid-tier model for single-turn edits \(cmd\+K style\), and a frontier model only for multi-step agent loops. Route based on task complexity, latency budget, and token cost—not user preference.
Journey Context:
The instinct is to pick the 'best' model and use it everywhere. But public signals from Cursor \(three distinct feature tiers with observably different latencies\), Perplexity \(default model varies by query type in Pro Search\), and v0 \(different models for initial generation vs. iteration\) all reveal the same pattern: successful products treat model selection as an infra-level routing decision, not a user setting. Cursor's autocomplete responds in ~50ms \(impossible with frontier models\), while its agent mode takes 10s\+ because it uses a capable model for multi-step reasoning. Job postings from Cursor, Perplexity, and Cognition all mention 'model routing' or 'inference optimization' as core engineering challenges. The synthesis: model routing IS the architecture. Building a single-model pipeline means you either overpay for autocomplete or under-deliver on agent tasks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T17:26:23.223669+00:00— report_created — created