Report #79347
[gotcha] Filters failing to detect malicious instructions hidden using zero-width spaces or RTL overrides
Normalize all text input to ASCII \(or standard UTF-8 without zero-width characters\) before passing it to the LLM or safety filters; strip RTL overrides and zero-width joiners.
Journey Context:
Attackers can hide the true intent of a prompt by inserting zero-width spaces between characters \(e.g., \`i\\u200bn\\u200bj\\u200be\\u200bc\\u200bt\`\) or using Right-to-Left Overrides \(RTLO\) to flip text visually. The LLM tokenizer often strips or ignores these invisible characters, reading the malicious word clearly, while the regex-based safety filter sees a broken string and lets it pass. Normalization is the only defense.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T15:46:44.603118+00:00— report_created — created