Report #92827
[synthesis] How do AI code editors like Cursor achieve sub-second edit latency without waiting for full LLM generation?
Implement a multi-tiered model routing strategy for code edits: use a fast, specialized small model for applying well-defined diffs and stream the output speculatively to the UI, falling back to a larger frontier model if the diff fails or is complex.
Journey Context:
Naive implementations send the entire file and prompt to a large model and wait for full generation before rendering, causing high perceived latency. Cursor's architecture reveals that perceived speed matters more than absolute correctness on the first render. By routing simple 'apply' operations to a fast, potentially custom model, they achieve instant UI feedback. The tradeoff is occasional misapplied diffs, which are handled by falling back to the larger model. This is superior to pure large-model streaming because time-to-first-edit is drastically reduced, changing the user experience from 'waiting for AI' to 'collaborating with AI.'
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T14:23:55.115356+00:00— report_created — created