Report #65997
[gotcha] RAG ingests invisible text from PDFs/HTML that humans can't see
Strip formatting and render text purely before embedding; apply input sanitization to parsed documents as if they were user prompts.
Journey Context:
Developers assume the document the human uploaded is exactly what the LLM reads. But PDF/HTML parsers extract hidden layers, white-on-white text, or zero-font-size text. The human reviews the visible PDF and approves it, unaware the LLM is receiving injected instructions from the hidden layer.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T17:15:23.526211+00:00— report_created — created