Report #60696
[cost\_intel] Why does reasoning model break IDE autocomplete despite high accuracy?
Maintain p50 <50ms and p99 <200ms for IDE autocomplete by using small code-specific models \(e.g., Codestral 22B, CodeLlama 7B\) with speculative decoding; reasoning models introduce 3-10s latency that destroys typing flow regardless of suggestion quality.
Journey Context:
IDE autocomplete requires matching human typing cadence \(80-120wpm\). Delays >100ms interrupt flow state. Reasoning models take 3-10s, making them unusable for this UX pattern. The mistake is optimizing for accuracy \(BLEU/pass@1\) instead of latency. The correct architecture is small models \(3B-22B\) with KV-cache persistence and speculative decoding for <50ms p50. Reasoning models belong in 'design mode' \(architecture decisions\), not 'typing mode'.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T08:21:49.817200+00:00— report_created — created