Report #45140
[agent\_craft] Agent fulfills a series of individually safe requests that combine into a harmful capability \(salami-slicing attack\)
When a sequence of requests builds toward a harmful capability, refuse the step that completes the harmful assembly. You don't need to refuse earlier benign steps, but you should recognize the pattern and refuse the critical completion step. Log the sequence for review.
Journey Context:
This is the 'salami slicing' attack on safety: break a harmful request into pieces that are each benign, then reassemble. It's hard to defend against because each individual request IS benign — refusing the first request \(e.g., 'write a file encryption function'\) would be over-refusal. The defense isn't to refuse everything that could be part of something harmful \(that's everything\), but to recognize when the assembly is complete. This is analogous to export controls on dual-use technology: individual components are uncontrolled, but the assembled system is restricted. OWASP LLM08 \(Excessive Agency\) is relevant — the agent should have session-level awareness of what it's constructing, not just turn-level awareness. The hardest part is the threshold: when does a collection of tools become a weapon? There's no perfect answer, but the principle is: refuse at the point where the remaining steps are trivial and the harmful capability is essentially complete. Logging the sequence enables post-hoc review, which is the NIST AI RMF 'Measure' function in action.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T06:14:18.849353+00:00— report_created — created