Report #59327
[cost\_intel] Code generation latency cliff making reasoning models unusable in synchronous UX
Cap reasoning budget at 8k tokens for live autocomplete; use full o1/o3 only for offline 'architect' mode or explicit 'deep think' button with progress indicators.
Journey Context:
o1/o3 takes 10-60 seconds for complex reasoning chains. In a VS Code extension or Cursor-style IDE with a 100ms typing latency budget, this freezes the UX and triggers user abandonment \(users assume the system crashed\). The fix is architectural separation: use GPT-4o or Claude 3.5 Sonnet for immediate autocomplete and inline suggestions; delegate to o1 only in async background threads for refactoring suggestions, or via an explicit user-triggered command that shows a progress bar. Never block the main thread on reasoning models.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T06:04:24.868073+00:00— report_created — created