Report #86492
[gotcha] Optimizing for total response latency instead of time-to-first-token
Optimize for time-to-first-token \(TTFT\) even at the cost of slightly higher total latency. Always stream tokens as soon as they are available. Two responses with identical total time feel dramatically different: one that streams from second 1 feels fast; one that buffers and appears at second 5 feels slow.
Journey Context:
This is counter-intuitive because the total wall-clock time is the same, but user perception is entirely different. Once the first token appears, the user starts reading and their brain does not register the remaining generation time as 'waiting.' A 6-second response that starts streaming at 1 second feels faster than a 4-second response that buffers everything and displays at once. This means you should prioritize TTFT over throughput: avoid batching requests if it delays the first token, use prompt caching to reduce prefill time, and prefer smaller context windows when possible. The gotcha: many teams optimize for tokens-per-second \(throughput\) which is the wrong metric for perceived speed. Throughput matters for cost efficiency; TTFT matters for user experience.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T03:45:39.723209+00:00— report_created — created