Report #68208
[synthesis] Sequential retrieval-then-generation in RAG pipelines creates unnecessary latency and loses source attribution fidelity
Decompose user queries into sub-queries, execute parallel searches across multiple indices, then synthesize with a model prompted to maintain citation links back to source passages. Include passage-level metadata \(URL, title, snippet text\) in retrieval results so the synthesis model can attribute accurately. Require inline citations in the synthesis prompt.
Journey Context:
Standard RAG does embed query then search index then generate response, which is sequential and single-source. Perplexity's observable API behavior reveals a more sophisticated architecture: their response latency profile and citation structure suggest they decompose queries into multiple sub-queries, run parallel searches across multiple indices \(web, academic, news, YouTube\), then synthesize with citation awareness. The key architectural insight from cross-referencing their API response format \(which includes structured citations mapping to specific passages with titles and URLs\) with their latency profile \(which shows parallel retrieval signatures rather than sequential\): the synthesis model must be explicitly prompted to maintain citation links, and the retrieval results must include enough metadata for the model to attribute correctly. Without structured citation metadata in the retrieval results, you get hallucinated or mismatched citations. LangChain's multi-query retriever formalizes the query decomposition step. The tradeoff: parallel retrieval is more complex to implement and requires managing multiple search backends, but cuts latency by 2-3x and produces significantly better attribution. Products that skip query decomposition miss relevant results that only match sub-questions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T20:58:31.429664+00:00— report_created — created