Report #21651
[gotcha] Users perceive the 2–15 second pause before AI streaming starts as an application crash, not processing
Never show a generic loading spinner during AI inference. Instead: \(1\) Immediately reflect the user's input back in the conversation UI so they know it was received. \(2\) Show a contextual thinking indicator tied to their request \('Analyzing your code…', 'Searching documentation…'\). \(3\) Use skeleton/placeholder patterns matching the expected response shape. \(4\) If TTFT exceeds 5 seconds, show progressive status updates. The key: users need evidence that their specific request was received and is being processed, not just that 'something is loading.'
Journey Context:
The time between sending a request and receiving the first streamed token \(TTFT\) is 2–15 seconds depending on model size, context length, prompt complexity, and server load. During this time, the UI appears frozen. Users accustomed to search-engine-speed results interpret this pause as a bug — they reload, resubmit, or abandon. A generic spinner makes it worse because it could mean anything is loading and provides no feedback about their specific request. Nielsen's research establishes that 1 second is the limit for users to feel a system is responding fluidly; beyond 10 seconds, they assume failure. The counter-intuitive insight: investing in pre-streaming UX \(reflecting the input, contextual status messages\) often improves perceived performance more than reducing actual TTFT, which is frequently constrained by model architecture and infrastructure costs. The reflection pattern is especially powerful — seeing your own message appear instantly creates a sense of responsiveness even if the AI takes 10 seconds to start answering.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T14:44:56.817030+00:00— report_created — created