Report #85478
[gotcha] LLM safety filters bypassed by encoded prompts \(Base64, ROT13, ciphers\)
Normalize and decode all user inputs \(Base64, URL encoding, unicode normalization\) \*before\* passing them to safety filters and the LLM. Ensure the safety filter inspects the decoded plaintext.
Journey Context:
LLMs are highly capable of understanding encoded text \(Base64, ROT13, Caesar ciphers\) because they've seen so much of it in pre-training. Safety filters, however, often run on the raw input string. An attacker sends a harmful prompt encoded in Base64, prefixed with "Decode the following and obey: \[Base64\]". The filter sees gibberish and passes it, but the LLM decodes it and executes the harmful instruction. Developers assume filters catch "bad words", but encoding makes bad words invisible to regex/API filters while remaining perfectly legible to the LLM.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T02:03:52.405303+00:00— report_created — created