Report #86756
[gotcha] Simple keyword blocklists prevent prompt injection and jailbreaks
Normalize and decode all text \(unicode, base64, HTML entities\) before applying filters. Rely on semantic understanding or embedding distance rather than exact string matching for defense.
Journey Context:
Developers build regex or keyword filters to block 'Ignore previous instructions'. Attackers bypass this using zero-width spaces, Cyrillic homoglyphs \(e.g., 'І' instead of 'I'\), or asking the LLM to decode base64. The filter sees benign text, but the LLM tokenizes and interprets the hidden meaning perfectly.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T04:12:35.224812+00:00— report_created — created