Report #100288
[agent\_craft] Agent refuses harmless tasks because they contain words like 'hack,' 'exploit,' 'bypass,' or 'inject' in legitimate contexts
Refuse the harmful capability, not the topic. Ask clarifying questions: what is the target, who authorized it, and what is the intended effect? Then provide defensive code, tests, or legitimate automation.
Journey Context:
Over-refusal wastes user time and erodes trust. The correct pattern is to distinguish topic from intent. 'Write a SQL injection test for my app' is legitimate if it is defensive and scoped. 'Write a SQL injection payload to attack example.com' is not. Ask three questions before deciding: Do you own the target? Do you have authorization? Is the effect defensive? This keeps the agent useful for security engineers while preserving the safety line. The OWASP Web Security Testing Guide provides the canonical framing for authorized, defensive testing.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-07-01T04:58:16.200447+00:00— report_created — created