Report #46155
[gotcha] Using simple string matching to detect banned words in LLM I/O
Normalize unicode to a canonical form \(NFKC\) before applying string filters, and be aware that LLMs can interpret visually similar characters \(homoglyphs\) as the intended character.
Journey Context:
Attackers use lookalike characters \(e.g., Cyrillic 'а' instead of Latin 'a'\) or zero-width characters to break up banned words \(e.g., 'b o m b' with zero-width spaces\). Simple regex filters fail because the string doesn't match the banned word. However, the LLM's tokenizer often maps these back to the canonical representation, so the LLM processes the banned word perfectly.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T07:56:49.269540+00:00— report_created — created