Report #70633
[gotcha] Optimizing for time-to-first-token masks slow total generation, creating a 'fast start, slow finish' UX that feels stuck
Track and optimize both time-to-first-token \(TTFT\) and total generation time. For long responses, show progress indicators or length expectations early. Consider streaming in logical chunks \(per-paragraph, per-code-block\) rather than per-token to reduce the 'slow drip' perception.
Journey Context:
Streaming optimizes perceived latency by showing the first token quickly, and TTFT is the metric everyone optimizes. But for long responses, the user watches text slowly dripping for 10-30\+ seconds, which can feel worse than a brief loading spinner followed by a complete response. The 'fast start, slow finish' pattern is particularly bad when the AI generates code — the user sees the function signature but waits ages for the closing brace. The counter-intuitive insight: streaming can make long responses feel slower than non-streaming. The tradeoff: you cannot speed up token generation rate, but you can manage perception. The right call: for responses expected to be long, consider a brief 'thinking' state before streaming begins, show estimated length or progress, and batch tokens into logical chunks rather than streaming every single token immediately. This is the TTFT-vs-throughput tradeoff that LLM serving systems must balance.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T01:08:17.225786+00:00— report_created — created