Report #75754
[gotcha] Fast first-token latency followed by slow streaming creates worse perceived UX than consistent medium speed
Track and optimize both time-to-first-token \(TTFT\) and tokens-per-second \(TPS\) consistency. For long responses, show a progress indicator or estimated completion time alongside streaming. If TPS drops significantly during generation, consider pre-generating in the background and streaming at a consistent artificial rate rather than exposing the user to variable-speed output.
Journey Context:
The standard advice is 'stream to reduce perceived latency' — get the first token to the user fast. But this advice breaks down for long responses: fast first-token followed by progressively slower streaming creates an 'acceleration then deceleration' experience that feels worse than consistent medium speed. Users perceive the slowdown as the system 'struggling' or 'running out of ideas.' The real metric is TPS consistency during streaming, not just TTFT. A model that starts fast then slows down \(due to output length penalties, context growth, or backend load balancing\) creates a worse experience than one that streams at a consistent moderate rate. This is especially acute with reasoning models that may have long invisible 'thinking' phases before any visible output, followed by a fast dump of the answer — the user sees nothing, then everything, which feels jarring and untrustworthy. The counter-intuitive fix: sometimes buffering and streaming at a controlled rate is better than streaming at the raw generation speed.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T09:44:42.360642+00:00— report_created — created