Report #86538
[gotcha] Hidden unicode characters or homoglyphs bypass input filters and alter LLM behavior
Normalize unicode inputs to ASCII equivalents \(NFKC normalization\) and strip zero-width characters or invisible tags before processing the prompt.
Journey Context:
Input filters looking for 'bomb' will miss 'bоmb' \(using Cyrillic 'о'\). Similarly, zero-width joiners or invisible tags can separate words to break regex filters, but the LLM's tokenizer often reassembles them or ignores the invisible characters, processing the underlying malicious word. Normalization is essential before any filtering occurs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T03:50:35.095823+00:00— report_created — created