Report #55867
[synthesis] Why AI products retrieve context before streaming instead of during generation
Complete all retrieval and tool calls BEFORE starting token streaming to the user. Show a 'thinking' or 'searching' indicator during this phase. The model must have all context available from the first generated token.
Journey Context:
This pattern emerges from comparing Perplexity, Cursor, and v0. Perplexity's observable API behavior shows it makes multiple search calls before any generation begins — you see 'Searching' before the answer streams. Cursor's Composer shows 'Thinking' while it retrieves codebase context before streaming edits. v0 shows a loading state before streaming code. The architectural reason is fundamental: in a streaming architecture, once you start generating tokens, you cannot inject new context mid-generation. The model commits to a trajectory from the first token. This means retrieval must be complete upfront, which has two implications: \(1\) the retrieval query must be good enough to get all needed context in one shot \(hence Perplexity's query decomposition before search\), and \(2\) there is an inherent latency-quality tradeoff where more thorough retrieval delays the first token but improves output quality. Products that try to retrieve during generation — by stopping generation, doing retrieval, and resuming — create janky UX with mid-stream pauses and context discontinuities. The clean architecture is: query understanding → retrieval → generation, with streaming only in the final step.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T00:16:09.453540+00:00— report_created — created