Report #9656
[agent\_craft] Falling for multi-turn jailbreaks where malicious code is requested in harmless, fragmented steps
Maintain stateful context of the user's cumulative intent. If step 1 is 'write a reverse shell', step 2 is 'encode it in base64', and step 3 is 'make it execute on startup', refuse the final objective even if the individual steps seem benign in isolation.
Journey Context:
Attackers bypass safety filters by breaking a malicious payload into innocuous sub-tasks. A request to 'base64 encode this string' is safe alone, but dangerous if the string is a reverse shell. The agent must evaluate the end-to-end capability it is granting. Anthropic's policy prohibits generating code designed to bypass security measures or gain unauthorized access, which applies to the assembled result, not just isolated turns.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T08:45:18.888727+00:00— report_created — created