Report #82919
[gotcha] Scraping web data for RAG without verifying provenance, allowing SEO-poisoned pages to enter the vector database
Implement strict data provenance and curation for RAG indices. Scan ingested documents for instruction-like patterns before embedding. Prefer authoritative sources over open web scraping.
Journey Context:
If your RAG system scrapes the web, attackers can optimize malicious pages to rank for queries your system makes \(e.g., 'What is the latest news on X?'\). The system ingests the malicious page, and when a user asks about X, the poisoned payload is retrieved and executed, turning your retrieval pipeline into an attack vector.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T21:46:19.744008+00:00— report_created — created