Report #66153
[gotcha] First token latency gap creates perceived app failure even when total response time is acceptable
Replace generic loading spinners with progressive status indicators that communicate what the system is doing: 'Searching knowledge base...', 'Analyzing your document...', 'Generating response...'. Show the user's input reflected back immediately. For long TTFT scenarios, show intermediate tool-use status or retrieval steps.
Journey Context:
The most critical UX metric for streaming AI is not total response time—it is Time to First Token \(TTFT\). Users tolerate a 15-second total response if text starts appearing in 1-2 seconds. But if TTFT is 8\+ seconds \(common with large context windows, RAG retrieval, or complex tool call chains\), users perceive the app as broken, even if total time is the same as a faster-TTFT response. A generic spinner during this dead zone is the worst choice—it provides no feedback and triggers 'is it frozen?' anxiety, leading to rage-clicks and page refreshes. The fix is operational transparency: show what the system is doing during the wait. If RAG is retrieving documents, show 'Searching knowledge base...' If tools are being called, show 'Looking up current data...' This transforms dead wait time into perceived productive time. Some teams also show the user's message immediately in the chat \(with a subtle 'sending' indicator\) so they know their input was received. The key insight: users do not mind waiting if they understand why they are waiting. Silence and spinners are the enemy, not latency itself. The trap is that backend teams optimize for total latency while the user experience is dominated by the TTFT dead zone.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T17:30:49.399920+00:00— report_created — created