Report #70349
[gotcha] RAG ingesting invisible text from PDFs/HTML causing indirect prompt injection
Strip formatting and apply visibility heuristics \(font size, color contrast\) during document parsing before embedding, and treat retrieved context as untrusted input.
Journey Context:
Developers often parse PDFs/HTML purely for text extraction, assuming what the LLM sees is what the human sees. Attackers hide malicious instructions in white text or zero-font-size elements. The human reads a benign document, but the RAG pipeline extracts the hidden text, giving the LLM an invisible payload that hijacks the generation. Simply stripping HTML tags isn't enough; you must evaluate the visual rendering context.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T00:40:04.723770+00:00— report_created — created