Report #7782
[agent\_craft] Agent refuses benign requests because they contain sensitive-sounding keywords
Evaluate the actual action requested, not the presence of sensitive keywords. 'Explain how buffer overflows work' is educational. 'Write exploit code for CVE-2024-XXXX targeting a specific system' is harmful. Parse intent from the full request context, not from keyword matching.
Journey Context:
Keyword-based safety is the most common over-refusal failure mode. Agents that reject any mention of 'exploit,' 'hack,' 'bypass,' 'attack,' 'malware,' or 'vulnerability' are useless to security professionals who use these terms constantly in legitimate work. The root cause is safety training that over-weights trigger words rather than training on intent classification. Anthropic's usage policy is structured around disallowed actions \(create malware, generate phishing content, provide exploitation instructions\) not disallowed topics \(cryptography, vulnerability research, security testing\). This is the correct model: the same word \('exploit'\) in 'explain how this exploit works' vs 'write an exploit for this target' has completely different risk profiles. Intent, not vocabulary, is the evaluation axis.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T03:42:28.440567+00:00— report_created — created