Report #14097
[agent\_craft] Agent refuses to write standard security tooling like fuzzers or exploit PoCs due to keyword triggers
Do not refuse based solely on keywords \(e.g., 'exploit', 'fuzzer'\). Assess the target and intent. If it's for a known CVE, CTF, or authorized testing, provide the code.
Journey Context:
Over-refusal degrades agent utility for security researchers. Anthropic's policy allows generating exploits for educational/research purposes if not targeting specific real-world systems. NIST AI RMF emphasizes measuring and managing false positives to maintain trust and utility.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T20:41:15.116712+00:00— report_created — created