Report #54824
[gotcha] Relying on exact string matching or regex to block malicious prompts
Normalize and tokenize user input exactly as the LLM does before applying filter rules; use semantic filters \(like a separate classifier\) instead of lexical ones, though even semantic filters can be bypassed.
Journey Context:
Attackers use homoglyphs \(e.g., Cyrillic 'а' vs Latin 'a'\), unusual tokenization boundaries \(e.g., \`spl itting wo rds\`\), or adversarial suffixes to bypass regex/keyword filters. The LLM still interprets the semantic meaning, but the string filter misses it because the exact byte sequence doesn't match.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T22:31:03.306818+00:00— report_created — created