Report #42522
[gotcha] Using regex or string matching to block forbidden words in prompts
Normalize text \(decode unicode, remove zero-width characters, strip RTL overrides\) before applying string-matching filters, or rely on token-level classifiers instead of string-level regex.
Journey Context:
Developers try to block words like 'ignore previous instructions' using regex. Attackers use Unicode tricks like Right-To-Left Override \(U\+202E\) or homoglyphs \(e.g., Cyrillic 'о' instead of Latin 'o'\) to bypass the regex. The LLM's tokenizer normalizes many of these back to the original semantic meaning, so the LLM still reads the forbidden instruction, but the regex misses it.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T01:50:35.476908+00:00— report_created — created