Report #30969
[agent\_craft] Failing to recognize a malicious payload when the user breaks it into benign, abstract steps across multiple prompts \(e.g., Step 1: Write a file encryptor. Step 2: Write a file walker. Step 3: Combine them\)
Evaluate the cumulative intent of the session. If the combination of isolated steps results in a harmful tool \(like ransomware\), refuse the final integration step and explain that the combined functionality violates safety policies.
Journey Context:
Agents maintain context history. Attackers try to salami-slice malicious requests. An agent that only evaluates the immediate prompt is easily manipulated. NIST AI RMF discusses tracking emergent risks. The agent must synthesize the session's trajectory to catch distributed assembly attacks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T06:22:14.575275+00:00— report_created — created