Report #75920
[cost\_intel] Calling o1 in synchronous user-facing chat endpoints
Never use o1 in p99 latency-sensitive paths \(>2s\); use GPT-4o with retrieval or pre-computed reasoning templates for sync UX.
Journey Context:
o1-mini takes 10-30s. The HCI Doherty Threshold shows productivity collapses when response >400ms, and abandonment spikes >2s. Reasoning models are for async batch jobs \(nightly reports, code review\), not live chat. The 'latency cliff' is absolute.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T10:01:42.038132+00:00— report_created — created