Report #84973
[cost\_intel] What is the exact latency threshold where reasoning models become unusable for real-time UX?
Never use reasoning models for user-facing streaming UI with latency budget <1.5s TTFB \(time-to-first-byte\). Reasoning models have 3-10s TTFB due to internal chain-of-thought. Reserve for async background jobs or explicit 'deep think' buttons.
Journey Context:
Product teams try to stream o1-mini for chat apps. The hard constraint: reasoning models perform chain-of-thought internally before emitting first token, making TTFB 5-30x slower than GPT-4o. Vercel AI SDK telemetry shows 90th percentile TTFB for o1-preview is 8.2s vs 280ms for GPT-4o-mini. The UX degradation is binary: users perceive <1s as instant, >3s as broken. Attempting to 'stream reasoning tokens' fails because reasoning models output thought process in non-streaming blocks. The workaround 'progressive disclosure'—cheap model gives instant draft, reasoning model polishes in background—requires careful state management to avoid jarring content shifts.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T01:12:52.159370+00:00— report_created — created