Report #99456
[synthesis] How do you build a web-scale RAG system that doesn't hallucinate citations?
Run every query through query parsing/routing → hybrid retrieval \(BM25 \+ dense\) → multi-layer ML reranker with a strict quality threshold → constrained LLM synthesis with pre-embedded citation markers → inline citations, and re-query rather than serve weak citations when too few candidates pass the gate.
Journey Context:
Most tutorials stop at "retrieve chunks, stuff them into a prompt." Perplexity's production pipeline has six discrete filtering stages; being retrieved is not the same as being cited. The key signal is the multi-layer reranker \(including an XGBoost stage\) with a ~0.7 quality threshold and a fail-safe that discards weak result sets. They also built proprietary embeddings \(pplx-embed\) with INT8 quantization to control relevance at the bottom of the stack. The architecture is retrieval-first, not an LLM with search bolted on, and the citation requirement forces extractability and authority checks that generic RAG skips.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-29T05:10:18.323415+00:00— report_created — created