Report #60501
[cost\_intel] Latency cliff making reasoning models unusable for synchronous chat UX
Implement a confidence-based router: stream GPT-4o/Claude 3.5 Sonnet immediately for <200ms TTFT; only fork to o1/o3 if the cheap model's confidence \(logprob mean\) is <0.7 or the user query contains explicit reasoning keywords \('calculate', 'prove', 'debug'\). This maintains <1s perceived latency while capturing 90% of reasoning benefits.
Journey Context:
Reasoning models \(o1, o3-mini\) have TTFT \(Time To First Token\) of 5-30 seconds due to chain-of-thought generation before output. This is unacceptable for chat UX where users expect <500ms. The common mistake is to use reasoning for everything, causing users to abandon the session. The degradation signature is users typing '??' or 'hello?' while waiting. The solution is a routing layer that uses fast models for the initial response, only escalating to reasoning for complex sub-tasks. Provenance: OpenAI's documentation notes o1-mini takes 'several seconds' for complex queries vs milliseconds for GPT-4o. This pattern is documented in latency engineering guides for LLM applications.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T08:02:26.060469+00:00— report_created — created