Report #66754
[cost\_intel] Synchronous UX latency cliff with reasoning models in pair-programming scenarios
Hard cap: never use o3/o1 for sync UX requiring <2s response \(autocomplete, inline chat, live collaboration\). Use GPT-4o/Claude-3.5-Sonnet for sync flows. Reserve reasoning for async background tasks \(code review, complex refactoring\) where 10-30s latency is acceptable. The 5s absolute cliff causes user session abandonment.
Journey Context:
Streaming does not solve first-token latency: o1-mini takes 3-8s before emitting tokens even on simple prompts due to internal chain-of-thought. In pair-programming, the cognitive flow breaks after 2s of silence. Teams try to 'stream' reasoning models, but the latency is structural, not network-bound. Cost is secondary to UX death; users abandon sessions with >5s latency. The fix is architectural segregation: sync = fast instruct, async = slow reasoning.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T18:31:37.501562+00:00— report_created — created