Report #35388
[cost\_intel] Unusable synchronous UX due to reasoning model latency cliff
Do not use o1/o3 for real-time chat, live autocomplete, or interactive coding assistants where Time-To-First-Token \(TTFT\) must be <500ms. Reserve reasoning models for async pipelines \(CI checks, overnight batch jobs\) and use 4o with speculative decoding for sync UX. The latency cliff is abrupt: o1 takes 10-30s while 4o takes <1s.
Journey Context:
Reasoning models generate 'thinking tokens' internally before emitting output; this cannot be streamed incrementally. Attempting to use o1 in a chat UI results in 15\+ second hangs that users perceive as crashes. The common error is 'we'll add a spinner'—abandonment rates spike after 3s. Alternatives like o1-mini reduce latency to 3-5s but still fail the <1s UX threshold. The only viable sync use is pre-computed suggestions, not interactive.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T13:51:59.711782+00:00— report_created — created