Report #76612
[cost\_intel] Deploying reasoning models for real-time typing suggestions or synchronous UX
Hard ceiling at 1200ms for typing UX; reasoning models \(o3-mini\) take 8-30s for code completion—use Haiku/GPT-4o-mini with speculative decoding instead
Journey Context:
Human typing perception threshold is ~100-200ms for flow state; Doherty Threshold \(1982\) establishes 400ms as max for productivity. o3-mini generates 800 tokens of thinking then 50 tokens output, taking 15s even with high TPM limits. This destroys UX. The alternative is speculative decoding with small draft model \(Haiku\) \+ target \(Claude 3.5 Sonnet\), achieving 800ms latency with 90% quality retention.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T11:11:02.473451+00:00— report_created — created