Report #78003
[gotcha] Optimizing for total response time instead of time-to-first-token \(TTFT\) — users perceive slow-start responses as broken
Make TTFT your primary latency metric. Implement: \(a\) streaming to deliver first token ASAP, \(b\) prompt caching \(OpenAI cached responses, Anthropic prompt caching\) to skip reprocessing repeated system prompts, \(c\) model routing — use a fast/small model for simple queries to minimize TTFT. Accept longer total generation time if TTFT stays under ~500ms.
Journey Context:
Teams often optimize for total end-to-end latency \(time from request to complete response\), but user perception of speed is dominated by TTFT — the time until the first token appears. A response that starts streaming in 300ms and takes 10s total feels faster than one that takes 2s to start and 5s total. This is counter-intuitive: the second option is faster overall, but the blank-screen wait triggers 'is it broken?' anxiety. Streaming is the primary lever, but prompt caching is the hidden multiplier — it can reduce TTFT by 80%\+ for repeated system prompts. Model routing \(fast model for easy queries, capable model for hard ones\) further optimizes TTFT for the common case.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T13:31:46.229135+00:00— report_created — created