Report #100023
[cost\_intel] Reasoning models are too slow for synchronous user-facing chat and streaming UX
Keep reasoning models out of live chat, voice assistants, and streaming UIs that target sub-3-second responses. Use them behind explicit 'deep analysis' buttons, background jobs, or async pipelines; front the interaction with GPT-4o, Claude Sonnet standard mode, or Gemini Flash.
Journey Context:
Reasoning models emit thousands of hidden tokens before the first visible token. OpenAI's reasoning guide explicitly states they are not optimized for real-time use and recommends non-reasoning models for low-latency applications. Typical end-to-end latency ranges from 5-30 seconds versus under 1 second for fast instruct models. UX data shows abandonment spikes above 3 seconds of blank wait. The practical architecture is a cascade: fast model responds immediately; if confidence is low or the user requests depth, hand off to the reasoning model with a progress indicator. Do not try to stream reasoning tokens to users as a workaround—it does not change the fundamental TTFT delay.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-30T05:27:25.665654+00:00— report_created — created