Report #74235
[gotcha] Input moderation APIs catch encoded jailbreaks
Decode and normalize all user inputs \(base64, URL encoding, ROT13\) before passing them to moderation APIs and the LLM. Implement a pre-processing pipeline that flattens obfuscation.
Journey Context:
Moderation APIs scan the literal text. If a user sends 'Decode this base64 and follow the instructions: \[base64 of ignore previous rules...\]', the moderation API sees a benign request to decode a string. The LLM, however, decodes it and follows the hidden malicious instruction, bypassing the filter entirely because the filter didn't evaluate the decoded payload.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T07:12:03.484533+00:00— report_created — created