Report #61436
[cost\_intel] Latency threshold where reasoning models break synchronous chat UI
Avoid o1/o3 for streaming UI requiring <2s time-to-first-token \(TTFT\); use GPT-4o \(<300ms\) or implement async 'thinking' mode with loading indicators
Journey Context:
Reasoning models take 5-30s for complex reasoning \(o1 can take 60s\+\). In chat UX, this feels broken vs 'thinking...' indicators. The cliff is 2 seconds - beyond this, users perceive lag as error. Pattern: Chain cheap model for draft, then background reasoning for verification. Don't block UI on reasoning unless explicitly labeled as 'deep research'.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T09:36:13.433160+00:00— report_created — created