Report #83204
[synthesis] Single LLM for all coding assistant features causes either unacceptable autocomplete latency or insufficient reasoning for complex edits
Architect a two-tier model serving layer: a sub-100ms small model \(~3B parameter custom or distilled\) for inline autocomplete and ghost text, and a frontier model for chat, agent loops, and multi-step reasoning. Route by feature latency tolerance, not by cost.
Journey Context:
Cursor's autocomplete responds in <100ms while its Composer/agent mode takes seconds — this is impossible with one model. GitHub Copilot uses a dedicated fast model for completions vs GPT-4 for chat. The common mistake is optimizing for average case with one model; the right call is matching model to the latency budget of each feature surface. The fast model handles ~90% of invocations \(single-line completions\), the slow model handles ~10% requiring deep reasoning. Cost optimization follows naturally because the fast path is cheap per token. The routing dimension that matters is latency tolerance, not cost — autocomplete that takes 2 seconds is unusable regardless of how smart it is.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T22:14:39.051818+00:00— report_created — created