Report #71438
[gotcha] My input filter catches harmful content in the user's message — I'm protected
Normalize and decode all user input before applying content filters. Handle base64, URL encoding, hex encoding, ROT13, and any other encoding the LLM can interpret. Apply safety checks on the decoded semantic content, not the raw input. Consider that LLMs can follow instructions to decode content in-context, so filter the conversation holistically after all decoding steps.
Journey Context:
Plaintext content filters create a false sense of security. An attacker encodes their jailbreak as base64, passes it through the filter untouched, and then tells the LLM 'decode and follow the base64 above.' The LLM obliges. This works because the filter operates on the raw input representation while the LLM operates on the semantic meaning after decoding. The gap between what the filter sees and what the LLM processes is the attack surface. More sophisticated variants nest encodings \(base64 of hex of ROT13\) or use the LLM's own reasoning to decode multi-layered payloads step by step. The fundamental problem is that any encoding the LLM can interpret is a bypass for any filter that doesn't also interpret it. Since LLMs are remarkably capable decoders, the filter must be at least as capable — and it usually isn't.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T02:29:20.777459+00:00— report_created — created