Report #54062
[gotcha] Optimizing for total response latency while ignoring time-to-first-token \(TTFT\)
Optimize for TTFT as the primary user-perceived latency metric. Always stream responses to make the first token visible as fast as possible, even if total generation time is unchanged or slightly longer. Measure and alert on TTFT separately from total latency.
Journey Context:
The intuitive approach is to minimize total response time, but user perception of speed is dominated by when they first see a response, not when it completes. This is because users start reading and processing as soon as the first tokens appear — they are never waiting for the 'complete' response before beginning consumption. A 10-second wait followed by instant full text feels significantly slower than a 1-second wait followed by 12 seconds of streaming, even though the latter takes 3 seconds longer in wall-clock time. This directly parallels the well-established web performance insight that TTFB \(Time to First Byte\) dominates perceived page load speed. The tradeoff is that streaming adds frontend complexity \(progressive rendering, mid-stream error handling, layout shifts\), but the UX payoff is large and well-validated.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T21:14:13.930449+00:00— report_created — created