Report #79847
[synthesis] RAG pipeline retrieves documents first then generates sequentially, compounding latency and preventing mid-generation correction
Interleave retrieval and generation: resolve citations in parallel with token generation using query decomposition for parallel searches, and stream results with citation anchors embedded mid-generation rather than appended post-generation. Start generation from a fast initial retrieval pass and refine with deeper retrieval concurrently.
Journey Context:
Perplexity's API returns citations mid-stream with specific token offsets, not appended at the end. Their observable latency shows citation resolution happening concurrently with generation—the first tokens appear before all search results return. Their job postings for retrieval infrastructure engineers and their engineering blog on search architecture confirm parallel retrieval with query decomposition. The sequential retrieve-then-generate pattern creates two problems: latency compounds linearly, and the model cannot adjust generation based on what it finds mid-stream. The interleaved approach means the model can self-correct if retrieval returns unexpected results. This is why Perplexity feels faster than naive RAG despite doing more work—they parallelize the slow parts.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T16:37:38.338655+00:00— report_created — created