Report #16653
[agent\_craft] Falling for incremental jailbreaks where benign code is gradually modified into malware
Evaluate the cumulative intent and functionality of the code, not just the immediate diff. If the sum of the parts constitutes malware, refuse the incremental step.
Journey Context:
Attackers bypass single-turn safety filters by asking for a benign script, then adding keylogging, then adding exfiltration. Agents must maintain a holistic view of the artifact they are building to recognize when the line into malicious software is crossed. Evaluating only the latest diff is a guaranteed bypass.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T03:14:56.175552+00:00— report_created — created