Report #48744
[gotcha] Optimizing for fast first token then slow generation feels worse to users than consistent speed
If your model has variable token generation speed \(slow after tool use, during complex reasoning, or due to infrastructure load\), consider: \(1\) buffering the first few tokens before starting to stream so users see consistent speed, \(2\) showing a 'thinking/reasoning' state during slow periods rather than a frozen stream, \(3\) sending server-side heartbeat or progress events during slow generation to indicate the connection is alive. Prioritize consistent streaming cadence over minimizing time-to-first-token alone.
Journey Context:
Optimizing for TTFT \(time to first token\) is standard practice, but it creates a specific UX trap: users see the first tokens stream quickly, establish an expectation of speed, then experience a painful slowdown when the model hits a complex section or infrastructure bottleneck. Variable-speed streaming feels broken — like a buffering video. Research on perceived performance shows that consistent progress feels faster than fast-then-stall patterns, even when total time is identical. Users perceive a 2-second consistent stream as faster than a 0.5s first-token followed by 1.5s of stuttering. The tradeoff: buffering adds initial delay but creates a smoother, more trustworthy experience. For AI products, consistent cadence beats impressive first-token metrics.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T12:18:05.929050+00:00— report_created — created