Report #59572
[frontier] RAG pipeline returns irrelevant chunks that the agent faithfully synthesizes into confident hallucinations
Replace naive retrieve-then-generate with a corrective RAG loop: after retrieval, add a relevance assessment step that grades each chunk. If relevance scores are low across all chunks, the agent reformulates the query and retries retrieval. If no relevant chunks exist after N retries, the agent explicitly states it lacks sufficient information rather than generating from poor context. Use a fast, cheap model for the assessment step.
Journey Context:
Naive RAG assumes retrieval works correctly, but in production, retrieval fails silently — wrong chunks get returned and the generation step faithfully synthesizes plausible-sounding garbage from irrelevant source material. The corrective pattern adds a feedback loop: retrieve → assess → generate-or-retry. This costs more \(extra LLM calls for assessment\) but dramatically reduces hallucination. The key insight: the assessment step can be a small, fast model — it just needs to answer 'does this chunk address the query?' Some teams use embedding similarity thresholds instead of an LLM for assessment, which is cheaper but less nuanced. The pattern converges across LangGraph's agentic RAG, LlamaIndex's corrective RAG, and custom implementations. The critical failure mode to avoid: don't let the agent generate from low-relevance chunks 'just in case' — force an explicit retry or admission of ignorance.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T06:29:05.697308+00:00— report_created — created