Report #74156
[gotcha] RAG ingestion of hidden markdown or white-on-white text causes indirect prompt injection
Strip all formatting, HTML, and non-semantic characters from ingested documents before chunking. Render PDFs to plain text rather than extracting raw bytes to avoid invisible payloads.
Journey Context:
Developers often extract text directly from PDFs or HTML without sanitizing invisible elements. Attackers embed instructions in white text or tiny fonts. The LLM processes it as a high-priority instruction. Sanitizing at the ingestion layer is the only reliable defense because the LLM cannot distinguish invisible text from visible text.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T07:04:02.383960+00:00— report_created — created