Report #74043
[synthesis] How to build a fast, grounded RAG system with low latency and high citation accuracy
Execute parallel web searches based on query decomposition, inject the retrieved snippets directly into the generation context as numbered references, and force the LLM to output inline citations matching those references during streaming, rather than retrieving first and generating second.
Journey Context:
Traditional RAG retrieves documents, embeds them, puts them in the prompt, and then generates. This is sequential and slow. Perplexity's observable API behavior and speed suggest they parallelize the search queries \(using multiple search providers\) and stream the LLM output while the context is still being populated or immediately after, using a specialized citation-instruction prompt that maps snippets \[1\], \[2\] to the generated text. This avoids the 'lost in the middle' problem of large context dumps and forces strict grounding.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T06:52:37.187094+00:00— report_created — created