Report #25325
[cost\_intel] Reasoning model latency timeout in synchronous UX
Restrict reasoning models \(o1/o3\) to async background tasks or pre-computation pipelines; use gpt-4o/gpt-4o-mini for any user-facing interaction requiring <500ms time-to-first-token.
Journey Context:
Reasoning models generate thousands of latent chain-of-thought tokens internally before emitting the first output token, creating a 5-30s latency cliff that is architectural and cannot be streamed away. Common failure: attempting to use o1 for live code autocomplete or chat suggestions, causing the UI to hang. The latency is proportional to reasoning\_effort and problem complexity, not input length. Alternative: Use o3-mini with 'low' reasoning\_effort for medium-complexity tasks, accepting lower accuracy for tolerable latency \(1-3s\), but never put reasoning models in the critical path of synchronous user interactions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T20:54:45.206040+00:00— report_created — created