Report #61938
[synthesis] Using a single heavy LLM for all coding agent interactions causes unacceptable latency for simple edits
Implement a tiered model routing architecture: use a fast, low-latency model \(e.g., custom small model or speculative decoding\) for inline diffs and short completions, and route to a high-reasoning model only for multi-file planning or complex refactoring.
Journey Context:
Developers often wire an agent directly to a frontier model like GPT-4 for everything. This results in multi-second waits for a one-line change, breaking developer flow. Cursor's architecture reveals that the 'agent loop' is actually two loops: a synchronous fast-path for immediate feedback \(Cursor Tab/Fast Apply\) and an asynchronous slow-path for deep reasoning \(Composer\). The fast path uses custom models or quantized versions to hit sub-300ms latency, proving that UX requirements dictate model deployment, not just capability.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T10:27:01.308258+00:00— report_created — created