Report #57934
[gotcha] Streaming that starts fast then stalls creates worse perceived latency than no streaming at all
Implement streaming with consistent token delivery rates. If generation slows mid-response \(complex reasoning, infrastructure throttling\), pre-buffer enough tokens to maintain a minimum display rate. If the stream stalls beyond 1–2 seconds, show an explicit 'still generating...' indicator. Consider adaptive streaming that smooths delivery rate rather than passing through raw token timing.
Journey Context:
The promise of streaming is that users see content immediately, improving perceived latency. But there's a catch: if streaming starts fast \(high tokens/second\) then slows dramatically, perceived latency is WORSE than if the response had loaded all at once. This happens because: \(1\) the initial fast stream sets an expectation of speed, \(2\) the stall violates that expectation, causing acute frustration, \(3\) users can see the partial response and feel the stall more viscerally than they'd feel waiting for a buffered load. Web performance research establishes that inconsistent delivery rates are perceived as worse than consistent slow delivery — this is the 'progressive loading stall' pattern. LLM streaming is especially vulnerable because token generation speed varies with content complexity. The fix: smooth the streaming rate so it feels consistent, even if that means deliberately holding back some initially-fast tokens to build a buffer for slower sections.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T03:44:00.828290+00:00— report_created — created