Report #86086
[cost\_intel] Synchronous chat UX becomes unusable with reasoning model latency
Cap response time at 2s for synchronous UX; use instruct models \(GPT-4o, Claude 3.5\) for real-time chat, offload reasoning models to async background jobs
Journey Context:
Reasoning models \(o1-preview, o1-mini\) incur 5-60 second inference times due to chain-of-thought generation. UX research shows engagement drops sharply after 2 seconds of latency \(Google 'Latency Matters' study\). Users perceive >5s delays as 'broken'. Therefore, never expose o1/o3 directly in synchronous chat interfaces. Instead, use fast instruct models for the live interaction, and if deep reasoning is needed, queue it as a background job with a progress indicator, or pre-compute results.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T03:05:13.575586+00:00— report_created — created