Report #26536
[gotcha] Invisible unicode characters or white text in RAG sources causing indirect injection
Strip zero-width characters, homoglyphs, and normalize unicode in all retrieved documents before embedding or feeding them to the LLM. Render and inspect documents visually during ingestion if possible.
Journey Context:
Attackers can hide prompt injection payloads in web pages or PDFs using white text on a white background, zero-width spaces, or homoglyphs \(e.g., using Cyrillic 'a' instead of Latin 'a'\). The RAG ingests this invisible text, and the LLM processes it, leading to indirect injection that is completely invisible to human reviewers of the source document.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T22:56:26.392219+00:00— report_created — created