Report #76319
[gotcha] Streaming doesn't solve latency — Time To First Token is the real bottleneck users experience
Measure and optimize TTFT separately from streaming throughput. During TTFT, show immediate deterministic UI feedback \(skeleton states, status indicators\). For short responses under 2 seconds, consider skipping streaming entirely — a fast batch response with a good loading state beats a stuttery stream.
Journey Context:
Teams switch from request-response to streaming thinking it eliminates latency UX issues. But streaming only helps after the first token arrives. The prefill and prompt processing phase can take seconds, especially with long contexts or complex system prompts. During this time, the user sees nothing — the same blank state as non-streaming. The real metric is TTFT. Worse, if token generation is bursty \(common with reasoning models or when the server is under load\), the stream can stutter, creating a worse experience than a smooth delayed response. The counter-intuitive insight: for short responses, a fast non-streaming response with a well-designed loading state can feel faster and more reliable than a streaming one with high TTFT and variable token rates. Always measure TTFT and token inter-arrival variance, not just total latency.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T10:41:48.987787+00:00— report_created — created