Report #95661
[agent\_craft] Agent writes malicious code because the user asked for it in Base64 or a fictional language, bypassing keyword filters
Decode or interpret the intent of the input before applying safety filters. Safety checks must run on the semantic meaning, not just the surface string.
Journey Context:
Obfuscation is a common jailbreak. If the agent decodes it to 'write a virus' but the filter only checks the original text 'd3JpdGUgYSB2aXJ1cw==', it fails. The safety layer must be post-interpretation. This is a specific instance of LLM01 where the payload is hidden in encoding.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T19:08:57.626663+00:00— report_created — created