Report #95155
[cost\_intel] When does reasoning model latency make synchronous UX impossible?
Do not use o1/o3 for chat interactions requiring <2s time-to-first-token \(TTFT\); reserve reasoning models for async workflows \(batch processing, email drafts\) or implement 'thinking' UI with server-sent events to mask 5-30s initial latency.
Journey Context:
Product teams often treat o1 as a drop-in replacement for GPT-4 in chat UIs. However, o1-preview's TTFT ranges 5-30 seconds depending on reasoning effort, while human UX research shows 2s is the patience threshold for conversational flow. The quality degradation isn't model accuracy but user abandonment. The fix is architectural: move reasoning to async \(e.g., 'Generate draft' button\) or use streaming 'thinking' animations. Claude 3.5 Sonnet or GPT-4o maintain sub-1s TTFT and should remain the default for synchronous turns.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T18:17:50.700449+00:00— report_created — created