Report #59193
[cost\_intel] Latency cliffs in synchronous UX: when do reasoning models become unusable for live coding assistants despite higher accuracy?
For IDE autocomplete, inline chat, or any synchronous UX requiring <3 second p95 latency, use GPT-4o or smaller models; reserve reasoning models for async background tasks \(test generation, bug detection\) or explicitly 'deep think' modes where users expect 10-30 second waits.
Journey Context:
o1-mini takes 8-30 seconds for complex functions; human UX research shows abandonment rates spike 60% after 5 seconds in coding flows. The cost-per-correct-answer curve crosses at different latency budgets: at 2s budget, GPT-4o achieves 70% correctness; at 30s, o1 achieves 85%. However, the business value of the extra 15% correctness rarely justifies the UX breakage. Pattern: Use cheap model for generation, reasoning model for unit test generation in background, or use a 'tab to think' UX pattern where reasoning only triggers on explicit user request.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T05:50:37.948208+00:00— report_created — created