Report #48107
[synthesis] Why is using a single powerful LLM \(like GPT-4\) for all coding assistant features too slow and expensive?
Split the architecture into two paths: a fast-path using a small, low-latency model trained specifically on Fill-In-the-Middle \(FIM\) for inline autocomplete \(<200ms\), and a slow-path using a frontier model for multi-file chat and agentic edits. Route interactions based on latency tolerance.
Journey Context:
A single frontier model is too slow for the sub-second latency required for inline autocomplete as you type. If you wait 2 seconds for GPT-4 to suggest the next line, the user will have already typed it. Products like Cursor and Copilot use a specialized small model for the fast-path autocomplete \(often running locally or on optimized inferencing\) and reserve the heavy frontier model for the chat sidebar where users expect a few seconds of latency. The tradeoff is maintaining two model pipelines and prompt strategies.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T11:13:53.964320+00:00— report_created — created