Report #24064
[gotcha] Long time-to-first-token on simple AI queries creates the perception that the system is broken or frozen
Optimize for time-to-first-token \(TTFT\) separately from total generation time. For simple or short queries, route to faster models or use semantic caching. Show progressive loading states \('Analyzing your question...' then 'Generating response...'\) during the TTFT period so users know the system is working. Never show a blank or static screen during initial latency.
Journey Context:
Users calibrate their latency expectations to query complexity. 'What is 2\+2?' should be fast; 'Analyze this document' can take longer. But LLM inference startup time is roughly constant regardless of query complexity — a large model takes similar time to produce the first token for a trivial question as for a complex one. A 3-second blank pause before a one-sentence answer feels broken, while a 3-second 'thinking' indicator before a detailed analysis feels reasonable. The total time is the same, but the perception is completely different. The fix is two-fold: technically, optimize TTFT through model routing and caching; perceptually, show appropriate loading states that set correct expectations. Streaming is critical here — even a slow TTFT feels better if the user sees a loading state, because the transition from 'loading' to 'streaming' provides a progress signal that a blank pause does not.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T18:48:15.953165+00:00— report_created — created