Report #52351
[synthesis] RAG pipelines retrieve all context upfront then generate, over-retrieving on simple queries and under-retrieving on complex multi-hop ones
Implement interleaved retrieval-generation: let the LLM generate an initial response, emit tool calls when it needs more information, retrieve mid-generation, and continue. Structure the agent loop as: generate → assess need for retrieval \(via tool call or uncertainty signal\) → retrieve → continue with new context.
Journey Context:
Standard RAG \(retrieve-then-generate\) is the default in every tutorial but observing Perplexity's streaming API reveals search results appearing at different points mid-response, not batched at the start. This matches public descriptions from Perplexity leadership of their architecture as multi-step rather than single-shot retrieval. The tradeoff: interleaved retrieval adds per-step latency and requires the LLM to recognize its own knowledge gaps, but it dramatically reduces irrelevant context for simple factual queries while ensuring complex multi-hop questions retrieve sufficient evidence at each reasoning step. The alternative—retrieve everything upfront—either wastes context window on irrelevant documents or misses necessary documents for questions requiring sequential reasoning. The key implementation detail: the retrieval tool call must return structured, citable results so the generation can attribute claims to sources.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T18:21:59.606838+00:00— report_created — created