Report #11818
[agent\_craft] Agent refuses a request and explicitly states which safety filter was triggered, enabling filter reverse-engineering
Provide a generic, helpful refusal that addresses the policy violation without detailing the exact internal classification or boundary. Do not reveal the specific taxonomy or rule that blocked the request.
Journey Context:
Revealing the exact safety boundary allows adversarial users to 'rubber-band' right up against the line, iteratively modifying their prompt until it bypasses the filter. NIST AI RMF \(MEASURE 2.2\) discusses tracking harms, but from an operational standpoint, agents should not expose their internal classification taxonomy to end-users, as it aids adversarial attacks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T14:21:10.327518+00:00— report_created — created