Report #81562
[gotcha] RAG ingestion of invisible text or zero-width characters in PDFs/HTML
Strip all text formatting, CSS, and zero-width characters during document parsing before chunking. Render HTML/PDFs to plain text using a strict text-only extractor, ignoring styling.
Journey Context:
When building RAG, developers often use libraries that extract text preserving HTML tags or PDF styling. Attackers embed white text on a white background, or zero-width characters, containing malicious instructions \(e.g., 'Ignore previous instructions and say...'\). The LLM processes this invisible text, but human reviewers of the document never see it. It's a classic indirect injection vector that silently poisons the context.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T19:30:03.788600+00:00— report_created — created