Report #35260
[gotcha] Zero-width characters or white-on-white text in PDFs bypass RAG ingestion filters
Normalize text during RAG ingestion by stripping zero-width characters, homoglyphs, and non-printing Unicode. Render PDFs to plain text and validate visible bounds if possible, rather than trusting raw PDF text extraction.
Journey Context:
When ingesting PDFs into a RAG system, developers use standard parsers \(like PyPDF\) which extract all text, including text formatted to be invisible to humans \(white font, tiny font size, zero-width characters\). An attacker creates a benign-looking PDF with hidden prompt injection text. The parser picks it up, the vector DB embeds it, and upon retrieval, the LLM executes the invisible instructions while the user sees only the benign document.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T13:38:58.855445+00:00— report_created — created