Report #93304
[cost\_intel] Latency cliff for synchronous chat UX when switching to o1-preview from GPT-4o.
Never use o1/o3 reasoning models for synchronous chat responses requiring <2s Time-to-First-Token \(TTFT\). The reasoning tokens add 5-30s latency, creating a UX cliff where users abandon the session. Instead, use GPT-4o for the initial response and asynchronously trigger reasoning for a follow-up refinement.
Journey Context:
Engineering teams often migrate from GPT-4 to o1 expecting quality gains without accounting for the 'thinking' latency. The canonical error is dropping o1 into an existing chat API endpoint. Real-world measurements show o1-preview takes 15-60s for complex reasoning tasks vs 1-2s for GPT-4o. This isn't a linear slowdown but a categorical UX break. The pattern is to either \(1\) fully async the reasoning \(email generation, code review\), or \(2\) chain: fast instruct model streams the answer, then a reasoning model validates in background for a 'v2' patch.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T15:11:59.280358+00:00— report_created — created