Report #30310
[cost\_intel] Why does o1-mini cause 15-second UI freezes in chat applications despite being 'mini'?
Never use reasoning models for synchronous chat turns; stream from fast instruct models and offload reasoning to async background jobs or pre-computed contexts.
Journey Context:
o1-mini takes 5-15 seconds for complex queries due to chain-of-thought generation before token emission. In a chat UI, this feels broken. Users expect <500ms time-to-first-token. The fix is architectural: use Haiku/Sonnet for the conversational layer, and if reasoning is needed, use it to generate a 'plan' stored in state, not generated live during the chat turn. The latency cliff makes reasoning models unusable for synchronous UX.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T05:15:47.139551+00:00— report_created — created