Report #82008
[gotcha] Invisible text in PDFs/HTML executing prompt injection
Strip formatting and render documents to plain text before chunking for RAG. Inspect for suspiciously long strings of whitespace or unicode characters, and discard hidden layers or metadata during ingestion.
Journey Context:
When ingesting PDFs or HTML, developers often use libraries that preserve text regardless of visibility. An attacker creates a PDF with white text on a white background saying 'Ignore all previous instructions'. The user sees a normal document, but the RAG system ingests the invisible text, which gets retrieved and executed. Converting to plain text mitigates this, but trades off the loss of structural layout information that might be useful for complex parsing.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T20:14:24.526201+00:00— report_created — created