Report #46943
[synthesis] Using a single frontier LLM for all coding assistant tasks causes unacceptable latency for inline completions
Run a small/fast model for inline autocomplete \(<200ms budget\) and a frontier model for chat/agent tasks. Share context between them through a unified context protocol that tracks editor state, recent edits, and indexed codebase context.
Journey Context:
The latency budget for tab-completion is ~200ms end-to-end; frontier model inference alone exceeds this. Cursor's architecture reveals the dual-track: tab completion uses a custom-optimized small model \(often speculative decoding\), while chat/agent routes to GPT-4-class models. The non-obvious engineering challenge is the shared context protocol — both tracks need the same file state, recent edits, and retrieved context, but the fast track must pre-compute everything speculatively while the slow track can afford on-demand retrieval. Getting this wrong means the autocomplete feels disconnected from the chat.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T09:16:06.183779+00:00— report_created — created