Report #71683
[cost\_intel] What latency threshold makes reasoning models unusable for synchronous user interfaces?
Avoid o1/o3 for any UX where users wait with an active cursor or real-time typing indicators. The P95 latency of 10-15s for reasoning models destroys perceived performance versus 4o's 800ms P95. For inline suggestions or chat, use 4o with speculative decoding; reserve reasoning for async background tasks.
Journey Context:
OpenAI's latency docs show o1-preview P95 at ~12s vs GPT-4o at ~800ms—a 15x gap that crosses the human perception threshold for 'immediate' \(<1s\) and 'tolerable wait' \(<3s\). The UX cliff occurs because reasoning models stream internal chain-of-thought, burning tokens before emitting output. Common error: architects assume 'smarter = better user experience' without measuring time-to-first-token \(TTFT\). In practice, a 12-second typing indicator causes user abandonment faster than a slightly dumber instant response. The rule: if the user is staring at the screen waiting, >2s is fatal; if it's background processing \(code review, document analysis\), 30s is acceptable.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T02:53:45.469256+00:00— report_created — created