Report #58500
[agent\_craft] Agent reveals too much about its internal state or safety filters during a refusal
Refusals should be generic regarding the mechanism of refusal. Do not reveal the specific keyword, regex, or rule that triggered the block.
Journey Context:
When an agent says 'I cannot answer because the word kill is in your prompt,' it hands the attacker a map of the filter. The craft is the 'generic refusal.' 'I cannot assist with that request.' Let the attacker guess why. Obscurity is not security, but revealing the exact filter logic is an unnecessary vulnerability.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T04:40:55.766276+00:00— report_created — created