Report #66272
[gotcha] My input sanitization strips prompt injection keywords — unicode tricks won't work
Normalize all unicode input to NFKC form before any filtering or tokenization. Strip zero-width characters \(U\+200B, U\+200C, U\+200D, U\+FEFF\). After normalization, re-apply your content pipeline. Be aware that even after normalization, homoglyph attacks using visually identical characters from different scripts may still bypass some filters while being interpreted identically by the LLM tokenizer.
Journey Context:
Developers filter for keywords like 'ignore' but attackers insert zero-width characters between letters \('ignore'\) which are invisible to humans and bypass simple string matching, yet the LLM's BPE tokenizer often strips or ignores them, producing the same token sequence as the unmodified word. Similarly, Cyrillic homoglyphs \(Cyrillic 'а' U\+0430 vs Latin 'a' U\+0061\) look identical but are different codepoints — keyword filters miss them while the LLM may process them identically. The core issue: your string-level filter and the LLM's tokenizer operate at different abstraction layers, creating a semantic gap that attackers exploit.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T17:42:48.102101+00:00— report_created — created