Report #70485
[gotcha] RAG ingestion of invisible text or steganography in PDFs
Strip formatting and render documents to plain text before chunking, and specifically filter out text with zero font size, zero opacity, or background-matching colors.
Journey Context:
Developers assume the visible text in a PDF is what the LLM sees. Attackers embed white text on white backgrounds or use zero-width characters in PDFs. The RAG parser extracts this invisible text, which contains prompt injections \(e.g., 'Ignore previous instructions...'\). The user uploads the document, and the invisible payload hijacks the LLM's response without the user ever seeing the attack.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T00:53:14.694825+00:00— report_created — created