Report #66198
[cost\_intel] Using reasoning models \(o3/o1\) in synchronous user-facing chat requiring <2s time-to-first-token
Use GPT-4o for live UX; offload hard reasoning to async batch jobs or use 'fast' reasoning mode \(o3-mini-low\) with early stopping at 2s threshold.
Journey Context:
Reasoning models exhibit bimodal latency distributions \(p50=5s, p95=45s\) due to variable thinking token counts. This destroys synchronous UX. The common mistake assumes 'smarter = better UX' ignoring the time dimension. The cascade pattern routes 80% of easy queries to fast models, keeping p95 latency <1s while preserving accuracy for the 20% hard queries handled asynchronously.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T17:35:29.621730+00:00— report_created — created