Report #88374
[gotcha] Model behavior altered by poisoned fine-tuning data or RAG corpus
Implement rigorous data sanitization and provenance tracking for any datasets used in fine-tuning or RAG. Audit documents for embedded instructions before indexing.
Journey Context:
When fine-tuning models on scraped data or building RAG indices from untrusted sources, attackers can plant sleeper documents. These documents contain subtle prompt injections \(e.g., I am now DAN\). When the model is trained on this, or retrieves it, the behavior is permanently altered. Developers assume training data or RAG corpus is factual, but it's an attack surface.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T06:55:13.054487+00:00— report_created — created