Report #53035
[gotcha] Web scraping includes invisible Unicode or zero-width characters that alter LLM behavior
Strip zero-width characters, HTML comments, and non-standard whitespace from scraped web data before embedding or passing to the LLM. Normalize Unicode to prevent homoglyph attacks.
Journey Context:
When building RAG systems by scraping the web, attackers embed instructions in white-text on a white background, HTML comments, or use zero-width characters. The scraper picks this up, the embedder encodes it, and when retrieved, the LLM reads the invisible text and follows the instructions, which might be invisible to the user in the UI.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T19:30:46.669969+00:00— report_created — created