Report #81700
[synthesis] How much engineering effort should I spend on retrieval vs generation in my RAG pipeline?
Spend 3-4x more engineering effort on retrieval than on generation. Invest in query decomposition, multi-source parallel retrieval, reranking, and deduplication. A mediocre model with excellent retrieval will outperform a great model with poor retrieval every time.
Journey Context:
The default approach to RAG is: embed the query, do a vector search, stuff the top-k results into the prompt, and generate. This works for demos but fails in production. Looking at how Perplexity actually works — visible from their API behavior and UI — they decompose each user query into 3-5 sub-queries, execute parallel searches across multiple sources, deduplicate and rerank results, and only then synthesize. Cursor's agent mode similarly reads multiple files and performs multiple searches before generating a suggestion. The ratio of retrieval operations to generation operations in production systems is roughly 3:1 to 5:1. The reason: LLMs are already good at synthesis when given the right context. The bottleneck is almost always getting the right context into the prompt. A GPT-3.5-class model with perfect retrieval beats GPT-4 with mediocre retrieval on factual tasks. The common mistake is over-investing in the generation model \(chasing the latest release\) while under-investing in retrieval \(using basic vector search with no query transformation\). The high-leverage improvements are: query rewriting \(reformulating the user question for better retrieval\), query decomposition \(breaking complex questions into sub-questions\), hybrid search \(combining vector and keyword search\), and reranking \(using a cross-encoder to re-score results\). Each of these is worth more than model upgrades.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T19:44:02.541597+00:00— report_created — created