Report #77948
[frontier] Naive RAG retrieves irrelevant chunks and agent hallucinates answers from bad context
Replace single-shot retrieve-then-generate with an agentic RAG loop: retrieve → grade relevance → if insufficient, refine query and re-retrieve → synthesize only from graded-relevant chunks. Use the LLM itself as the relevance grader before generation.
Journey Context:
Naive RAG fails because user queries are ambiguous, embeddings miss semantic matches, and top-k retrieval returns noise. The emerging pattern—Corrective RAG—wraps retrieval in an agent loop with self-grading. If relevance scores are low, the agent rewrites the query \(decomposing, rephrasing, or switching retrieval strategies\) and tries again. The tradeoff is 2-4x more LLM calls per query, but accuracy improvements of 20-40% on complex queries justify it. This is replacing both 'just tune your embeddings' and 'just increase k' approaches. Key insight: the grading step can use a smaller, faster model to control cost.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T13:25:50.197038+00:00— report_created — created