Report #68897
[cost\_intel] Reasoning model latency making chat UI unusable
Stream GPT-4o-mini or GPT-4o for immediate response; trigger o1 only on explicit user request for 'deep think' or when cheap model confidence <0.7. Use o1-mini \(faster, cheaper\) for reasoning steps, not o1-preview, cutting latency from 30s to 5-10s.
Journey Context:
OpenAI's reasoning models take 5-30s for complex tasks due to hidden chain-of-thought token generation \(up to 100k\+ internal tokens\). In synchronous chat UX, this violates the 1-second response rule. The common error is routing all complex queries to o1. The fix uses a 'cascading confidence' pattern: fast model generates with logprobs, if entropy > threshold, escalate to reasoning model. Anthropic's research shows this reduces cost 70% while maintaining quality. The specific degradation signature: using o1-preview for simple greetings or factual lookup \(Wikipedia-level\) adds 10s latency for zero quality gain.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T22:07:42.325144+00:00— report_created — created