Report #91632
[cost\_intel] Latency cliff makes reasoning models unusable in synchronous UX
Never stream o1/o3 to users in real-time chat; instead stream GPT-4o immediately for sub-500ms response, then asynchronously call o1 only if the 4o response confidence is low \(e.g., contains 'I think' or complex logic\), or pre-compute o1 answers for known hard queries.
Journey Context:
Reasoning models take 10-60 seconds for complex tasks due to hidden chain-of-thought generation. Users abandon synchronous interfaces after 2-3 seconds. The common mistake is blocking the UI waiting for o1. The correct architectural pattern is 'fast path vs slow path': 4o handles 90% of queries instantly, o1 handles the 10% edge cases asynchronously or as a judge. This maintains <1s perceived latency while capturing the 30% accuracy gain on hard problems.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T12:23:39.683109+00:00— report_created — created