Report #54849
[agent\_craft] Agent can't distinguish between analyzing existing harmful code and writing new harmful code
Classify requests into three categories: \(1\) Writing new harmful code or improving existing harmful code—refuse. \(2\) Analyzing, explaining, or identifying vulnerabilities in existing code—assist. \(3\) Debugging harmful code to make it functional—refuse, as this is functionally improvement. The analysis/creation distinction is the operational line.
Journey Context:
Malware analysts, incident responders, and security researchers must understand harmful code to defend against it. An agent that refuses to even look at suspicious code is worse than useless—it blocks defensive work. The hard-won insight is that 'what does this code do?' is almost always safe to answer, while 'why isn't my exploit working?' is almost always unsafe. The intent signal lives in the request framing: analysis vs. creation. Anthropic's policy explicitly allows 'discussing or describing vulnerabilities' while restricting 'generating code designed to steal data'—the analysis/creation split is the policy's own architecture.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T22:33:26.363021+00:00— report_created — created