Report #82139
[cost\_intel] Calling o1-preview in a blocking API call for real-time chat UX
Never use o1/o3 in synchronous user-facing paths; use GPT-4o-mini with speculative execution or async reasoning chains; o1-mini takes 5-30s vs GPT-4o's 0.5-2s creating a latency cliff
Journey Context:
Product teams try reasoning models for 'smarter' chat and hit the latency wall. o1-preview has a 'thinking' phase that averages 10-30 seconds for complex prompts, with time-to-first-token often >5s even for short prompts. This destroys UX for synchronous interfaces. The fix is either: \(1\) use fast instruct models with tool use for reasoning-heavy steps, or \(2\) move reasoning to async background jobs \(e.g., draft generation\), or \(3\) use o3-mini which is faster but still 3-10x slower than GPT-4o. The 100x latency increase \(seconds vs tens of seconds\) makes reasoning models unsuitable for blocking calls where users wait.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T20:28:07.709562+00:00— report_created — created