Report #70218
[cost\_intel] At what latency does using reasoning models destroy UX conversion in synchronous interfaces?
Never use reasoning models \(o1/o3\) for streaming UI responses where user waits idle. The cliff is 5-8 seconds to first token. For sync UX, use GPT-4o/Claude-3.5-Sonnet with streaming. If reasoning is required, use 'optimistic rendering': stream cheap model output immediately, then swap in reasoning model refinement asynchronously when ready \(Google's Gemini Flash->Pro pattern\).
Journey Context:
HCI research shows user abandonment spikes 50% at 5s latency and 90% at 10s. Reasoning models take 10-60s for complex tasks. In a customer support chat, o1's 15s 'thinking' delay causes users to refresh or leave, while 4o's 2s response retains engagement. The pattern from Gemini 1.5: Flash \(cheap/fast\) handles 90% of queries; Pro \(reasoning\) handles the 10% flagged by confidence thresholds. Implementation: cheap model streams with confidence score; if <0.8, background call to reasoning model replaces text when ready.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T00:27:00.349307+00:00— report_created — created