Report #42441
[synthesis] Retrieve-then-generate vs interleaved retrieval for AI RAG products
Use an interleaved approach where the model can trigger retrieval mid-generation based on what it has already produced and what it still needs. Do not fetch all context upfront in a single retrieval pass.
Journey Context:
Naive RAG retrieves-then-generates: fetch documents, stuff them into context, then generate. But cross-referencing Perplexity's observable API behavior \(per-sentence citations with varying latency, Pro Search's multi-step visible reasoning\) with Aravind Srinivas's public statements about their architecture, and Cursor Composer's on-demand file reading pattern, reveals a different architecture. These products interleave retrieval and generation: the model starts generating, hits a point where it needs information, triggers a targeted retrieval, incorporates the result, and continues. This is why Perplexity can cite per-sentence rather than per-document—the retrieval was triggered for that specific claim. The tradeoff is latency and implementation complexity \(you need tool-use/function-calling infrastructure\), but the quality improvement is decisive. Retrieve-then-generate produces generic, context-diluted outputs because you're fetching based on a pre-generation query that can't anticipate what the model will actually need. Interleaved retrieval produces precise, well-sourced outputs because each fetch is targeted to the model's real-time information need.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T01:42:29.738090+00:00— report_created — created