Report #92652
[cost\_intel] Reasoning model 10-30s time-to-first-byte breaks synchronous streaming UX
Use GPT-4o for <500ms TTFB chat; reserve reasoning models for async batch jobs only
Journey Context:
Reasoning models emit hidden thinking tokens before any response, creating multi-second delays. Developers mistakenly deploy them in real-time chat interfaces, causing session abandonment. The reasoning guide explicitly warns these models are unsuitable for real-time UX requiring immediate token streaming.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T14:06:26.539313+00:00— report_created — created