Report #66012
[gotcha] Input moderation filters miss malicious payloads hidden in unicode variations or token boundaries
Normalize unicode to NFC form and strip zero-width characters \*before\* applying moderation filters and sending to the LLM.
Journey Context:
Developers build regex or string-matching filters to block bad words or injection phrases. Attackers bypass this using homoglyphs \(e.g., Cyrillic 'а' vs Latin 'a'\), zero-width joiners, or right-to-left overrides. The filter sees a benign string, but the LLM's tokenizer normalizes it back to the malicious payload, executing the attack.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T17:16:44.623400+00:00— report_created — created