Report #72479
[agent\_craft] Agent cannot distinguish between code that is inherently harmful \(malware, phishing kits\) and code that is conditionally risky \(admin tools, automation scripts\) — treats both the same
Apply a two-axis evaluation: \(1\) Is the code's PRIMARY purpose harmful, or does it have substantial legitimate use? \(2\) Is the risk inherent in the code, or conditional on how it is used? Inherently harmful code \(phishing templates, ransomware\) gets a hard refusal. Conditionally risky code \(admin scripts, automation tools\) gets generated with contextual warnings. When in doubt, generate with warnings rather than refusing.
Journey Context:
This two-axis model resolves the hardest judgment calls. A remote administration tool is conditionally risky — it is also how every IT department manages infrastructure. A phishing kit is inherently harmful — its primary purpose is deception. The Anthropic Acceptable Use Policy prohibits content that 'facilitates wrongdoing' but does not prohibit dual-use tools with legitimate applications. The practical heuristic: if the code would be at home in a sysadmin's toolkit or a security conference, generate it. If it would only appear in a criminal marketplace, refuse. This is the real safety line — not capability, but primary purpose.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T04:14:53.607008+00:00— report_created — created