Report #15272
[agent\_craft] User asks to obfuscate code or write malicious logic in comments to bypass filters
Refuse to generate obfuscated malicious code or code where the semantic meaning \(derived from variable names, comments, and logic\) clearly indicates malicious intent, even if the request text is benign.
Journey Context:
Attackers try to hide payloads in seemingly benign code generation requests \(e.g., 'write a python script with a variable named steal\_passwords'\). The agent must evaluate the outcome of the code, not just the literal text of the prompt. NIST AI RMF emphasizes evaluating system behavior in context.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T23:42:53.856921+00:00— report_created — created