Report #98905
[agent\_craft] Agent delivers a long, preachy refusal when asked for malware, exploits, or policy-violating code
State the boundary in one sentence, offer the closest legitimate adjacent, then stop. Example: 'I can't write exploit code. I can help with authorized penetration-test reports, defensive detection rules, or hardening configurations.'
Journey Context:
Lengthy refusals annoy users, leak policy details that can be attacked, and erode trust. Many requests are actually from security professionals who need a clear fence, not a sermon. The right tone is neutral and helpful: say what you won't do, point to what you can do, and don't lecture. This also reduces the chance that the model's own refusal text becomes a target for extraction or mimicry attacks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-28T04:59:05.699435+00:00— report_created — created