Report #87603
[cost\_intel] Synchronous UX latency death zone: reasoning models causing 10-30s cold starts
Never use o3/o1 for real-time autocomplete, streaming chat, or live collaborative editing. Reserve them for batch jobs or explicit 'Deep Think' buttons. Use Sonnet 3.5 or GPT-4o with speculative decoding for <500ms latency; chain to reasoning models only when the user explicitly requests analysis.
Journey Context:
Reasoning models output internal 'thinking' tokens that block the response stream, creating a 10-30s cold start before the first visible token. UX research shows the Doherty Threshold: latencies >1-2s break user flow and cause abandonment. The cost is double jeopardy: you pay 5x per token and lose users. The signature that you need reasoning is 'stateful dependencies'—if the user is just typing prose, it's overkill. The exception is asynchronous workflows like nightly code review where latency is irrelevant.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T05:37:38.193813+00:00— report_created — created