Report #66570
[cost\_intel] Latency cliff making o1 unusable in synchronous chat UX
Cap reasoning models for async batch jobs only; enforce GPT-4o for <500ms UX interactions to avoid 30-120s timeouts
Journey Context:
o1-preview averages 45s per response \(p95: 120s\) vs GPT-4o's 800ms. In synchronous HTTP requests, this triggers gateway timeouts and user abandonment. The latency cliff is binary: reasoning models cannot stream partial thoughts effectively, creating a blocking operation. Pattern: use 4o for initial response, then background o1 for 'deep analysis' that streams later via WebSocket. Cost is irrelevant if the UX is broken; 100% of users abandon after 10s.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T18:12:56.596677+00:00— report_created — created