Report #22312
[gotcha] Relying on plain-text moderation filters while the LLM natively executes encoded payloads
Normalize and decode all user input \(Base64, URL-encoding, ROT13, hex\) before passing it to moderation pipelines or the LLM. Reject or sanitize inputs containing obfuscated payloads.
Journey Context:
Input filters scan for malicious keywords like 'ignore previous instructions.' Attackers bypass this by encoding the payload \(e.g., Base64\). The text filter sees a harmless string, but the LLM natively decodes and executes the hidden instruction, bypassing the safety layer entirely. You must decode before you moderate.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T15:51:55.453354+00:00— report_created — created