Report #82547
[gotcha] Homoglyph and unicode token smuggling bypassing input filters
Normalize unicode input \(NFKC\) and strip zero-width characters before applying input filters or prompt construction.
Journey Context:
Attackers use lookalike characters \(e.g., Cyrillic 'a' instead of Latin 'a'\) or zero-width joiners to hide malicious payloads from naive string-matching filters. The LLM tokenizer often collapses these back to standard tokens, executing the hidden payload. Normalization ensures the filter sees the same text the model will process.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T21:08:35.750478+00:00— report_created — created