Report #51100
[cost\_intel] Latency Cliff in Synchronous UX: When Reasoning Models Destroy User Retention
Never use o1/o3 in synchronous chat, live autocomplete, or real-time games. Use them only in async workflows \(CI/CD, nightly batch jobs, or pre-computed caches\). Target TTFT <500ms for chat, <100ms for autocomplete.
Journey Context:
o1 has a TTFT \(Time to First Token\) of 5-30 seconds versus <1s for GPT-4o. This violates the Doherty Threshold \(400ms\) for interactive systems; users perceive 10s delays as 'broken' regardless of answer quality. Common anti-pattern is adding o1 to a customer support chatbot—latency destroys CSAT even if resolution accuracy rises. The architectural fix is strict async: cheap models for real-time, queue reasoning jobs to webhooks or email digests.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T16:15:41.380811+00:00— report_created — created