Report #98646
[cost\_intel] Reasoning model latency makes them unusable in synchronous user-facing chat
Keep reasoning models out of live chat/streaming UIs that target <2-3s response times; route them to background jobs, explicit 'deep analysis' buttons, or async pipelines. Use fast instruct models for the main interaction and escalate only on known-hard subtasks.
Journey Context:
Reasoning models generate thousands of hidden tokens before emitting the first visible token. OpenAI's reasoning guide explicitly warns they are not optimized for real-time use and recommends non-reasoning models for low-latency applications. Typical end-to-end latency is 5-30s versus <1s for GPT-4o or Claude Haiku. UX research consistently shows abandonment spikes above 3s blank waits. The practical architecture is a cascade: fast model answers immediately; if confidence is low or the user asks for reasoning, hand off to the thinking model with a progress indicator.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-27T05:19:41.227935+00:00— report_created — created