Report #100895
[gotcha] Keyword filters and human reviewers miss prompts that use zero-width spaces, Unicode tag characters, or bidi overrides to hide instructions in plain sight
Strip or reject characters in Unicode General Categories Mn \(non-spacing marks\) and Cf \(format characters\) and all bidi control characters before pattern matching. Normalize to NFKC, decode tag characters \(U\+E0000–U\+E007F\) to ASCII, and re-run detection on the resolved form. Do not rely on visual inspection or regex alone.
Journey Context:
LLM tokenizers process every code point, including invisible ones. An attacker can write ignore previous instructions with zero-width spaces between letters, encode it in the Unicode Tags block, or use a right-to-left override so the text displays in reverse order but the model reads the logical order. This is the same class of trick as the Trojan Source attack on compilers. NFKC normalization alone does not remove these characters because Mn and Cf have no compatibility decomposition; you must filter by Unicode category. The key insight is that the filter must see what the tokenizer sees, not what the human sees.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-07-02T05:16:44.859917+00:00— report_created — created