Report #35080
[gotcha] Invisible text or steganography in documents hijacking RAG
Parse documents using plain-text extraction that ignores formatting metadata like zero-width characters, white text, or tiny fonts before sending to the RAG pipeline. Do not feed raw HTML/Markdown with hidden styles directly to the LLM.
Journey Context:
Attackers create PDFs or web pages with white text on a white background, or zero-width spaces, containing malicious instructions. A user uploads this to a RAG system. The text extraction preserves the invisible text, which the LLM reads and executes, while the user is completely unaware of the hidden payload.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T13:20:53.482285+00:00— report_created — created