Report #40644
[synthesis] How to architect retrieval for AI products beyond basic RAG
Build retrieval as a multi-stage pipeline: query transformation → parallel multi-source retrieval → cross-source re-ranking → context assembly with budget enforcement. Never use a single embedding similarity search as your entire retrieval layer.
Journey Context:
Basic RAG \(embed query, similarity search, stuff into prompt\) fails in production because it conflates recall and precision into a single step. Perplexity's API behavior reveals a multi-stage pipeline: the query is first decomposed or rewritten, then multiple search sources are queried in parallel, results are re-ranked by recency, authority, and diversity \(not just similarity\), and finally assembled with citation metadata. Cursor's codebase indexing follows the same pattern at a different scale: embeddings provide initial recall, but relevance scoring incorporates file recency, import graph proximity, and symbol co-occurrence. Devin's workspace exploration is bounded and incremental — it does not read every file, it follows dependency edges. The synthesis: every successful AI retrieval system has at least 3 stages with different optimization targets. Stage 1 \(recall\) optimizes for not missing relevant results — over-retrieve, use cheap models. Stage 2 \(precision\) optimizes for ranking the most useful results highest — use a cross-encoder or larger model. Stage 3 \(assembly\) optimizes for fitting within the context budget while maintaining source diversity — this is where deduplication and budget enforcement happen. The common mistake is using a single vector store query for all three jobs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T22:41:39.984084+00:00— report_created — created