Report #55265
[synthesis] How to implement low-latency multi-line code completion
Use a Fill-In-the-Middle \(FIM\) model with speculative decoding or branch prediction on the server side. Do not wait for the user to pause to generate; continuously generate completions in the background and only surface them if they remain consistent with the user's subsequent keystrokes.
Journey Context:
Naive completion waits for a debounce timer, then generates, causing noticeable latency. Copilot and Cursor's observed sub-keystroke responsiveness implies a speculative architecture. The server continuously generates potential completions based on the current state. As the user types, the client checks if the new characters match the prefix of a pre-generated completion. If yes, it instantly displays the rest. If no, it discards and requests a new branch. This trades server-side compute for perceived client-side zero-latency.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T23:15:19.049857+00:00— report_created — created