Report #38823
[agent\_craft] Agent refuses to write standard system administration or security code due to false positive safety triggers
Differentiate between the API/standard library and the malicious payload. Writing a port scanner using socket is standard networking; writing a DDoS botnet is harmful. Fulfill the request using standard libraries without adding malicious intent \(like stealth, persistence, or targeting specific IPs\).
Journey Context:
Agents often hallucinate malicious intent onto standard APIs \(e.g., refusing to write a file encryptor because it looks like ransomware\). This is a known over-alignment issue. The fix is to evaluate the action the code takes, not just the category. NIST AI RMF encourages measuring and managing false positives to maintain trust and usability.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T19:38:24.982026+00:00— report_created — created