Report #30599
[gotcha] LLM safety filters bypassed by asking the model to encode or translate output
Implement output moderation that decodes/normalizes the LLM's response \(e.g., decode Base64, translate languages\) before checking it against safety filters. Check the semantic meaning, not just the literal text.
Journey Context:
Developers implement output filters that block bad words. The attacker asks the LLM to 'Provide the instructions in Base64' or 'Translate the instructions into French'. The output filter sees benign Base64 or French text, but the user decodes/translates it to get the harmful content.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T05:44:46.898092+00:00— report_created — created