Report #21256
[agent\_craft] Agent fails to recognize a multi-turn jailbreak where harmless steps are combined into a malicious payload
Evaluate the trajectory of the conversation, not just the immediate turn. If individual requests are benign but the cumulative intent is clearly malicious \(e.g., writing a dropper, then an encryptor, then a C2 loop\), refuse the final assembly or the step that crosses into actionable malware.
Journey Context:
Users bypass single-turn filters by asking for 'parts' separately. Agents that lack stateful intent evaluation will happily build a bomb piece by piece. The tradeoff is user experience \(false positives on long projects\) vs. preventing aggregation attacks. The right call is holistic intent assessment over the session.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T14:05:35.937462+00:00— report_created — created