Report #94547
[cost\_intel] Maintaining sub-3-second response time in real-time chat interfaces
Never use o1/o3 for synchronous UX; latency ranges 10-60s vs 1-2s for GPT-4o. Implement 'Deep Research' button or async background processing only.
Journey Context:
o1-preview averages 15-30s, o3-mini 5-15s depending on reasoning effort. This exceeds human attention thresholds \(2-3s\) for conversational flow. The only viable pattern is 'fast path' with 4o, then optional 'analyze deeper' triggering reasoning model. Attempting to stream reasoning tokens doesn't help because the model generates the full internal CoT before emitting output tokens \(API limitation\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T17:16:58.336566+00:00— report_created — created