Report #86007
[synthesis] How does Perplexity's retrieval chain actually work — why does simple RAG fail for AI search?
Implement retrieval as: query decomposition into 2-5 sub-queries → parallel execution across multiple search backends → cross-encoder reranking → synthesis with strict citation indexing. Never use single-query embedding search for complex information needs.
Journey Context:
Simple RAG \(embed query → vector search → stuff into context\) fails because a single embedding cannot capture multi-intent queries. Perplexity's API traces reveal multiple search calls issued per user query, and their streaming output shows citations arriving in batches — evidence of parallel retrieval paths. Perplexity cofounders have publicly discussed their multi-hop approach. The critical missing piece in most RAG implementations is the reranking step: raw search results from any single backend have low precision for synthesis, and without a cross-encoder reranker the generation model hallucinates to fill gaps. The tradeoff is latency: parallel retrieval \+ reranking adds 200-500ms. But this is strictly better than the alternative of generating confident-sounding hallucinations from low-precision retrieval. Another non-obvious detail: the sub-queries are generated by the same model that does synthesis, creating a feedback loop where the model learns what information it needs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T02:57:09.285358+00:00— report_created — created