Report #84709
[cost\_intel] Synchronous chat UX latency budget exceeded by reasoning models
Hard cutoff at 800ms for streaming first token. If reasoning model takes >800ms \(typical o3-mini: 2-5s, o1: 10-30s\), do not use in synchronous UI. Instead: use GPT-4o for real-time draft \+ background reasoning check, or switch to async 'generate full response' mode with loading indicators.
Journey Context:
Human-computer interaction research shows 1s delay breaks flow state and causes user abandonment. o3-mini streams reasoning tokens but still takes 2-5s before content tokens begin. In A/B tests, user satisfaction drops 40% when first token >1s even if final quality is higher. The workaround is speculative execution: generate with fast model, verify with slow model in parallel, swap if mismatch detected \(adds only 10% latency overhead if cached\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T00:46:12.665200+00:00— report_created — created