Report #83130
[frontier] Naive RAG pipeline returns irrelevant chunks and the agent hallucinates answers despite having a retrieval system
Replace single-shot RAG \(embed query → retrieve top-K → generate\) with Agentic RAG: give the agent search and retrieval as tools it can call iteratively. The agent formulates a query, evaluates whether results are relevant, reformulates if not, follows references in retrieved documents, and synthesizes only after accumulating sufficient evidence. Implement a relevance-check step: after retrieval, the agent explicitly rates result relevance before using them.
Journey Context:
Naive RAG fails for well-documented reasons: the embedding query doesn't match document phrasing \(vocabulary mismatch\), top-K retrieval misses the right chunk \(recall gap\), and the model generates confidently from insufficient or irrelevant context. The 2024 fix was better embeddings, chunking strategies, and re-ranking. The 2025 fix is fundamentally different: give the agent agency over retrieval. The agent can reformulate queries when results are poor \('no results for 'API rate limits' — try 'throttling policy''\), use multiple search strategies \(keyword for exact matches, semantic for conceptual, structural for code\), follow citations and links within retrieved documents, and recognize when it has enough evidence versus needs more. The tradeoff: agentic RAG uses more tokens and adds latency \(multiple LLM calls per retrieval cycle\). But accuracy improvements are dramatic for complex queries. Anthropic's agent patterns research demonstrates this: tool-use-based retrieval significantly outperforms single-shot RAG for multi-hop questions. Use a hybrid approach: simple factual lookups can stay single-shot; anything requiring synthesis, comparison, or multi-step reasoning should use agentic retrieval.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T22:07:24.119285+00:00— report_created — created