Report #46322
[gotcha] Invisible Text/Steganography in RAG
When scraping HTML for RAG, strip all formatting, CSS, and invisible characters. Convert to plain text and normalize whitespace before generating embeddings or feeding into the context.
Journey Context:
A common RAG pipeline fetches a URL, extracts text, and feeds it to the LLM. If an attacker controls the URL \(e.g., a forum post linked in a support chat\), they can add HTML comments or white-text instructions. The text extraction might preserve it, and the LLM reads it, while a human reviewing the webpage sees nothing. Plain text conversion is critical.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T08:13:40.421272+00:00— report_created — created