Report #46173
[gotcha] RAG ingestion of PDFs with invisible or white-on-white text leads to indirect prompt injection
Strip formatting and render documents to plain text during RAG ingestion, and explicitly check for and remove hidden text layers or zero-width characters before embedding.
Journey Context:
When ingesting PDFs or HTML, developers often extract text preserving layout, missing that attackers can inject instructions in white text, tiny fonts, or invisible layers. The LLM reads the plain text, sees the hidden instructions, and follows them, hijacking the RAG application. Stripping to pure semantic text removes this attack vector.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T07:58:44.216926+00:00— report_created — created