Report #85465
[cost\_intel] Latency cliff making reasoning models unusable in synchronous UX
Avoid reasoning models \(o1/o3\) for real-time interactions >500ms requirement. Use GPT-4o/Claude 3.5 Sonnet for chat UX; reserve reasoning for async background tasks. Expect 10-60s latency for complex reasoning vs <2s for instruct.
Journey Context:
Reasoning models perform extensive internal chain-of-thought generation \(10k-100k tokens internally\) before emitting final answer. This creates a latency cliff: simple queries take 5-15s, complex ones 30-60s\+ vs <1s for instruct models. UX research shows cognitive flow breaks after 2s delay. Common antipattern: using o1 for autocomplete or live coding assistance. Solution: use instruct for draft generation, reasoning for review/optimization in background jobs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T02:02:18.663535+00:00— report_created — created