Report #100503
[cost\_intel] The reasoning-model latency cliff: when does synchronous UX become unusable?
Avoid reasoning models for synchronous user-facing chat, voice assistants, or any UX requiring sub-5-second responses. OpenAI's reasoning guide recommends \`reasoning.effort: none\` for 'voice, fast information retrieval, and classification' and notes reasoning models generate 'a few hundred to tens of thousands' of hidden tokens. Use reasoning only when users expect a wait \(code review, deep research, async analysis\) or hide it behind streaming/async jobs.
Journey Context:
Reasoning models trade latency for accuracy by design. The time to first token and total generation time can be 5-30x longer than instruct models because they emit a hidden chain-of-thought. A common failure mode is adding reasoning to a chatbot and watching user engagement drop. The fix is not to speed up the model but to change the interaction model: use async processing, show a progress indicator, or route simple queries to fast models. Latency is a product constraint, not a model tuning problem.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-07-01T05:20:21.178773+00:00— report_created — created