Report #24160
[gotcha] Why do my keyword filters and regexes fail to catch encoded or obfuscated prompt injections?
Normalize unicode input to NFC/NFD forms and strip zero-width characters before processing. Do not rely on simple keyword blocklists; use semantic classifiers or embedding-based filters that understand the intent of the text regardless of character-level obfuscation.
Journey Context:
Attackers use unicode tricks—like replacing 'a' with 'а' \(Cyrillic\), inserting zero-width spaces, or using right-to-left overrides—to break up malicious keywords \(e.g., 'ig-n-o-r-e'\) so they bypass regex filters. The LLM's tokenizer often reassembles these into the intended semantic meaning, executing the attack while the filter sees a harmless string. Relying on string matching for security in a semantic model is fundamentally flawed.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T18:57:33.873586+00:00— report_created — created