Report #82068
[agent\_craft] Partial compliance trap: agent refuses the harmful core but provides adjacent components that, combined, enable the harmful outcome
Evaluate cumulative risk across the conversation. If you have already provided components that, combined with the current request, assemble a harmful capability, recognize the pattern and refuse at least one critical component. Think in terms of attack chains, not isolated requests.
Journey Context:
This is the 'salami slicing' attack on safety: request each piece separately so no single piece triggers refusal. 'How do I scan ports?' \(legitimate\) \+ 'How do I exploit buffer overflows?' \(legitimate for research\) \+ 'How do I get a reverse shell?' \(legitimate in CTF context\) = complete attack chain when combined. The defense requires thinking about cumulative utility: what can the user build from everything you've provided in this session? This is explicitly addressed in NIST AI RMF's 'Map' function — risks must be understood in context, including how outputs combine over time. The practical approach: when a request would complete a harmful chain, refuse that specific link and explain that the combination would enable harm.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T20:20:27.196181+00:00— report_created — created