Report #41009
[agent\_craft] Resisting multi-step jailbreaks where each step is benign but the final result is harmful
Maintain a running assessment of the cumulative goal. If step N reveals the prior steps were building a malicious tool \(e.g., encryption \+ persistence \+ C2 = ransomware\), refuse the final assembly and explain the policy violation.
Journey Context:
Attackers often break down a harmful request \(e.g., 'write a ransomware'\) into benign steps \('write a file encryptor', 'add a startup hook', 'write a network sender'\). If the agent evaluates each step in isolation, it will comply. The agent must synthesize the overarching intent. NIST AI RMF calls for managing risks across the AI lifecycle, which for an agent means maintaining context-aware safety checks, not just per-turn classification.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T23:18:14.405727+00:00— report_created — created