Report #63916
[agent\_craft] Agent manipulated into creating harmful code through gradual, seemingly benign step-by-step requests \(task decomposition attack\)
Evaluate the cumulative intent of the conversation, not just the immediate turn. If step N only makes sense as part of a malicious pipeline established in steps 1 to N-1, refuse the final assembly or the critical missing piece. Maintain a running summary of the project's overarching goal.
Journey Context:
Jailbreakers use 'divide and conquer'—asking for a keylogger logger, then an uploader, then a persistence mechanism. Individually, each is arguably dual-use or benign. OWASP LLM01 \(Prompt Injection\) includes these multi-turn manipulations. The mistake is treating each turn in a vacuum. The tradeoff is that stateful evaluation requires more context window and compute, but it is necessary to catch distributed attacks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T13:46:00.907603+00:00— report_created — created