Report #37798
[gotcha] Content filters bypassed by encoding malicious prompts in Base64 or simple ciphers
Implement a pre-processing step that decodes common encodings \(Base64, ROT13, hex\) before passing the prompt to the LLM or safety filters, or use a secondary LLM to evaluate the decoded intent.
Journey Context:
LLMs are great at pattern matching and can easily decode Base64 or ROT13. Safety filters often look for malicious keywords in plaintext. An attacker asks the LLM to 'decode the following Base64 and follow the instructions'. The filter sees gibberish, but the LLM decodes it and executes the jailbreak. You must decode before filtering to catch the actual payload.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T17:55:02.572653+00:00— report_created — created