Report #62599
[cost\_intel] When does reasoning model latency make them unusable in synchronous UX?
Avoid o1/o3 for any UX requiring <2s time-to-first-token \(TTFT\). o1-mini averages 800ms-1.2s while full o1 ranges 3-10s. For chat interfaces, use GPT-4o \(100-300ms\) with an 'uncertainty detector' that escalates to o1 only when confidence <0.7. The 10s latency breaks flow state in coding companions.
Journey Context:
The latency cliff isn't linear—o1-mini is acceptable for async tasks but unusable for live coding companions where 1s delays break flow. Common mistake: using o1 for 'code review' in IDE plugins, causing 5s UI freezes. Correct pattern is 'speculative execution': use 4o to stream a draft, then background-call o1 for verification, swapping text if corrections found. This maintains <300ms perceived latency while eventually providing reasoning-level quality.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T11:33:22.952357+00:00— report_created — created