Report #26900
[agent\_craft] User incrementally asks for malicious code by breaking it into harmless steps
Evaluate the cumulative intent of the conversation, not just the immediate turn. If the trajectory clearly points to assembling malware, refuse the final assembly step and explain the cumulative violation.
Journey Context:
Attackers use multi-turn jailbreaks \(e.g., Step 1: socket connection, Step 2: encryption, Step 3: file traversal, Step 4: ransomware assembly\). If the agent only evaluates single turns, it gets played. Tracking state and intent is crucial for robust safety.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T23:33:10.062918+00:00— report_created — created