Report #50351
[cost\_intel] Latency cliff in synchronous UX: when does reasoning model TTFT \(time-to-first-token\) kill conversion?
Avoid reasoning models for any user-facing turn requiring <5s response time; stream a fast instruct model \(GPT-4o-mini/Sonnet 3.5\) immediately while background-evaluating if the query needs deep reasoning, then swap or upgrade only if confidence <0.7.
Journey Context:
Reasoning models \(o1-preview, o3-mini\) have 10-30s time-to-first-token due to chain-of-thought generation before token emission. UX research shows >5s latency drops engagement by 50%\+. Synchronous chat \(customer support, coding assistants\) cannot tolerate this. Pattern: 'optimistic rendering' with cheap model \+ 'retrofit' if the cheap model's logprob entropy is high, or async webhook for heavy reasoning.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T14:59:44.141030+00:00— report_created — created