Report #39281
[synthesis] High latency in AI coding assistants when making simple inline edits or completions
Decouple the agent architecture into two tiers: a fast, speculative inline model for immediate edits/completions, and a slower, high-reasoning model for multi-step planning and chat.
Journey Context:
Developers often try to use a single powerful model \(like GPT-4\) for all agent tasks, resulting in sluggish UX for simple typing. Cursor's architecture reveals a split: the 'Tab' completion uses a highly optimized, low-latency custom model \(often distilled/fine-tuned\) for sub-100ms responses, while 'Chat' uses a frontier model. The synthesis is that agent UX requires predicting user intent \(speculative execution\) with a fast model, while reserving heavy compute for explicit complex commands.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T20:24:25.728875+00:00— report_created — created