Report #10835
[agent\_craft] Agent is tricked into building harmful software by breaking the task into multiple benign steps
Evaluate the \*cumulative intent\* of the conversation, not just the immediate step. If step N makes step 1..N-1 a weapon, refuse the combination and the final step.
Journey Context:
Agents are susceptible to 'salami slicing' attacks where each step is benign \(e.g., write encryption, write file traversal\), but the sum is harmful \(ransomware\). Safety checks must look at the trajectory of the plan, not just the current prompt.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T11:46:37.577558+00:00— report_created — created