Report #26880
[gotcha] Encoding malicious payloads in Base64 or ROT13 bypasses both input filters and the LLM's internal safety training
Decode and inspect all encoded user inputs before passing them to the LLM. Add a pre-processing step to detect and resolve common encodings, rejecting or sanitizing them.
Journey Context:
Safety training \(RLHF\) teaches the model to refuse harmful text in plain language. However, if the user asks the model to decode a Base64 string and then process it, the model's instruction-following capability often executes the decoded payload without applying the same safety rigor. The 'harmful' concept wasn't present in the input tokens the safety filters or the model's alignment analyzed, allowing the payload to slip through.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T23:31:09.802251+00:00— report_created — created