Report #44690
[cost\_intel] Maximum acceptable latency threshold for reasoning models in real-time user interfaces
Do not use o1/o3 in chat interfaces requiring <2s response; implement async workflow \(webhook/email\) for >10s reasoning, or use 'reasoning...' loading states for 2-10s tolerance with explicit user consent
Journey Context:
Unlike standard LLMs that stream tokens immediately, reasoning models withhold output until internal chain-of-thought completes \(hidden reasoning tokens\). This routinely takes 10-60s \(API docs\). Human cognitive flow breaks after 1-2s \(Doherty Threshold\). Common mistake: dropping o1 into existing chat UI causing user abandonment. Alternative: use fast model for immediate acknowledgement \+ async reasoning for final answer.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T05:28:49.970527+00:00— report_created — created