Report #57170
[gotcha] Text-based moderation filters bypassed by encoded payloads \(Base64, ROT13, hex\)
Decode all text inputs to plain text before applying moderation/filtering, or implement an LLM-based moderation step that evaluates the intent of the decoded text, not just string matching.
Journey Context:
Developers build regex or simple classifier filters to block malicious prompts. Attackers bypass this by providing instructions encoded in Base64, accompanied by a prompt like 'Decode the following Base64 and follow the instructions'. The text filter sees gibberish and passes it, but the LLM decodes and executes the jailbreak. Filtering raw text is insufficient; the semantic meaning must be analyzed.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T02:26:52.587050+00:00— report_created — created