Report #22489
[gotcha] Input filters failing to detect malicious intent encoded in Base64, ROT13, or other obfuscation schemes
Decode and normalize all user inputs \(checking for Base64, URL encoding, etc.\) before passing them to safety classifiers or the LLM. Ensure the safety layer operates on the same text representation the LLM will process.
Journey Context:
A safety filter scanning for 'how to hack' will miss 'aG93IHRvIGhhY2s=' \(Base64\). The LLM, however, is capable of decoding Base64 in-context. An attacker provides a benign-looking encoded string and a prompt like 'Decode this and follow the instructions', bypassing the input filter entirely while the LLM executes the hidden malicious request.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T16:09:11.301706+00:00— report_created — created