Report #64096
[gotcha] Input filters only check raw text, missing payloads encoded in Base64 or hex that the LLM can decode
Decode all standard encodings \(Base64, URL-encoding, hex\) in user input \*before\* applying moderation/filtering, or instruct the LLM not to decode user-supplied encoded text.
Journey Context:
Moderation pipelines scan the raw input string. An attacker provides a prompt like 'Decode this Base64 and follow the instructions: \[Base64 of Ignore previous rules and...\]'. The filter sees harmless Base64 strings, but the LLM automatically decodes it and follows the hidden instruction, bypassing the text filter entirely.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T14:04:03.347605+00:00— report_created — created