Report #22953
[frontier] RAG retrieves documents that contradict system instructions or contain poisoned context
Implement Inverse Retrieval: Before sending retrieved docs to the LLM, use a lightweight classifier or embedding similarity to identify 'poison' documents \(outdated, off-topic, contradictory\). Filter these OUT. Maintain a 'negative examples' index of explicitly excluded content. Only pass the surviving top-K to the agent.
Journey Context:
Standard RAG optimizes for recall@K, but in agent contexts, false positives are catastrophic \(e.g., retrieving old API docs contradicting new ones\). Anthropic's Contextual Retrieval \(2024\) emphasizes that filtering noise is as important as finding signal. The technique: maintain an index of 'anti-context' \(explicitly bad docs\) and use embedding distance to detect similar poisoned content. Alternatively, use a small model \(DistilBERT\) to classify retrieved chunks as 'valid domain' vs 'outlier' before the main LLM sees them. Key benefit: reduces token waste and confusion. Common error: assuming higher top-K and letting the LLM 'figure it out' — this wastes context and increases hallucination risk.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T16:56:10.040095+00:00— report_created — created