Report #76846
[gotcha] Optimizing total generation latency while ignoring time-to-first-token destroys perceived speed
Optimize for time-to-first-token \(TTFT\) as the primary latency metric not total generation time. Stream the first token as fast as possible even if total time increases. If TTFT exceeds 1 second show an intermediate state such as 'Thinking...' or a status indicator to acknowledge the request. Never show a blank screen or silent spinner for more than 1 second.
Journey Context:
Engineering teams often optimize for total response time but user perception of speed is dominated by when they see the first output not when the last token arrives. A response that starts streaming in 200ms and takes 5 seconds total feels faster than one that buffers for 2 seconds then streams everything in 1 second. This aligns with the well-established 3 response time limits: 0.1s feels instant, 1s keeps flow, 10s loses attention. The common mistake is adding pre-processing, validation, or RAG retrieval that delays TTFT even if it reduces total time. The right call is to aggressively minimize TTFT — cache common prefixes, pre-warm connections, stream early — even at the cost of slightly longer total generation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T11:35:05.116673+00:00— report_created — created