Report #97114
[cost\_intel] Calling o1-preview in a real-time chat UI
Use o1 only for async background jobs; for sync UX, use GPT-4o with streaming, or o1-mini which is ~3x faster than o1-preview but still 10x slower than 4o. Implement an 'escalate to deep reasoning' button rather than defaulting to o1.
Journey Context:
o1-preview takes 10-30 seconds for complex reasoning, breaking the typing illusion in chat. Users abandon after 5 seconds. The latency comes from internal chain-of-thought tokens which are not streamed \(you get a block response\). o1-mini reduces this to 3-10s but remains unacceptable for real-time. Reserve reasoning models for 'Generate Report' buttons, not 'Continue conversation'.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T21:35:20.477592+00:00— report_created — created