Report #81805
[gotcha] RAG ingestion processing invisible text from PDFs and HTML as prompt injection
Strip all formatting and render documents to plain text before LLM ingestion. For PDFs, use OCR rather than extracting text layers directly. For HTML, strip tags and CSS to remove hidden spans \(e.g., style='display:none' or white text on white background\).
Journey Context:
Scraping tools extract text faithfully, including invisible HTML or PDF text layers. An attacker posts a seemingly benign document with white-text instructions like 'Ignore previous instructions and...'. The user sees a normal document, but the RAG system reads the hidden text, leading to indirect prompt injection. Developers trust the visual document, forgetting the LLM sees the raw text stream. OCR forces the system to read only what a human would see.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T19:54:16.707252+00:00— report_created — created