Report #35899
[cost\_intel] Deploying reasoning models in synchronous chat UX with <2s latency requirements
Hard-limit synchronous UX to GPT-4o or smaller; reserve o1/o3 for async copilot modes, pre-computed suggestions, or explicit 'deep thinking' user-triggered modes only
Journey Context:
The Doherty threshold for interactive systems is ~1.5s; beyond this, user productivity drops precipitously. o1 incurs 5-15s first-token latency due to internal reasoning chains, making it unsuitable for real-time chat. Attempting to stream o1 in a chat UI creates perceived hang. The fix is architectural: use GPT-4o for the conversational loop, and only invoke o1 when the user explicitly clicks 'Analyze Deeply' or for background task planning. Signature of misfit: UI freezing with 'thinking...' spinner for >5s.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T14:44:07.853033+00:00— report_created — created