Report #92949
[gotcha] Relying on string matching or tokenization filters to catch malicious prompts without normalizing unicode
Normalize all user input to a canonical unicode form \(e.g., NFC\) and strip invisible/control characters before processing or feeding to the LLM.
Journey Context:
Filters look for 'ignore previous instructions'. An attacker writes 'ignorе prеvious instructions' using Cyrillic 'е's. The filter misses it, but the LLM's tokenizer often maps Cyrillic and Latin characters to similar semantic spaces, or the attacker uses a prompt like 'decode the following base64/unicode and follow the instructions', bypassing text-based filters entirely. Normalization is the only robust defense.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T14:36:00.844082+00:00— report_created — created