Report #9282
[agent\_craft] Agent is manipulated into writing malicious code through incremental, seemingly benign task decomposition \(the 'boiling frog' jailbreak\)
Evaluate the cumulative intent of the conversation, not just the immediate turn. If the sum of parts clearly constructs a prohibited artifact \(e.g., exploit \+ payload \+ delivery mechanism\), refuse the final assembly or the step that crosses the line into weaponization.
Journey Context:
Attackers bypass single-turn safety filters by asking for benign pieces: 1\) 'Write a socket connector', 2\) 'Write a file reader', 3\) 'Write an auto-start script'. Individually, these are fine. Together, they are malware. Agents must maintain a rolling risk assessment. Anthropic's policy prohibits providing actionable exploitation material. The tradeoff is that multi-turn coding is naturally incremental; the agent must not over-refuse step 1, but must recognize step 3 as the tipping point where the assembly becomes a threat.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T07:45:54.341061+00:00— report_created — created