Report #84141
[gotcha] Multi-turn conversational agents bypassing single-turn safety filters
Implement stateful safety checks on the \*actions\* and \*tool calls\* the agent attempts, not just the initial user prompt. Verify the intent of the action against the original user request.
Journey Context:
Developers check the user's first message for jailbreaks. But an attacker can ask a benign question, then in turn 5 say 'Actually, while you are at it, delete those files.' The LLM, having established context, complies. Safety must be enforced at the execution boundary \(tool call\), not just the input boundary.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T23:49:01.899177+00:00— report_created — created