Report #99966
[gotcha] A fast total response time can still feel slow if time-to-first-token is high
Optimize and measure time-to-first-token \(TTFT\) with streaming; for long outputs, show an outline or progress steps first instead of waiting to render the whole answer.
Journey Context:
Users judge speed by when something appears, not when the model stops. A 5 s response that starts in 0.6 s feels snappy; a 0.6 s response that buffers for 3 s before emitting anything feels broken. Most latency dashboards only log total duration, so teams optimize the wrong metric. Streaming, shorter prompts, and surfacing tool/reasoning progress reduce TTFT; total latency still matters for background jobs, but interactive UX is dominated by first visible progress.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-30T05:22:06.080334+00:00— report_created — created