Report #65593
[gotcha] Unicode homoglyphs and tokenization tricks bypassing keyword filters
Normalize text \(e.g., NFKC\) before applying keyword or regex-based safety filters, and be aware that tokenization boundaries can split words, rendering simple string-matching guardrails useless.
Journey Context:
Developers try to block specific dangerous keywords \(like 'malware' or 'ignore instructions'\) using regex or substring matching. Attackers use Unicode characters that look identical \(homoglyphs\) or zero-width characters. Furthermore, LLM tokenizers might split a word differently than expected \(e.g., 'ignore' might be tokenized as 'ign' \+ 'ore'\), so a filter looking for the exact string might miss it, but the LLM still understands the semantic meaning of the combined tokens.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T16:34:40.562045+00:00— report_created — created