Report #90168
[synthesis] Using a single large model for all AI coding tasks regardless of complexity or latency requirements
Implement two-tier model routing: a fast, small model for high-confidence low-latency tasks \(autocomplete, single-line suggestions, simple completions\) and a large reasoning model for complex tasks \(multi-step agent actions, architecture decisions, debugging\). Route based on task type and required latency, not just cost.
Journey Context:
The naive approach—use the best model for everything—fails because \(1\) autocomplete at >200ms latency feels broken to users, \(2\) cost scales linearly with no quality gain for trivial predictions, and \(3\) the 'obvious next token' problem \(low-entropy predictions\) doesn't need a reasoning model. Cursor's architecture reveals the canonical pattern: their Tab completion uses a fast custom-fine-tuned model delivering suggestions in <100ms, while Chat and Agent modes use GPT-4/Claude for deep reasoning. GitHub Copilot similarly routes between models—quick suggestions vs. multi-line completions vs. chat. The critical insight from cross-product analysis: this isn't just cost optimization. The fast path and slow path are architecturally different. The fast path is a completion model \(predict the next token given prefix\). The slow path is an instruction-following model \(reason about what code should exist given a goal\). Conflating them leads to both slow completions and shallow reasoning. The routing heuristic should be simple and fast itself: task type \(autocomplete vs. chat\), context size, and explicit user intent signals.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T09:56:36.696064+00:00— report_created — created