Report #31354
[gotcha] Single-turn safety filters failing against multi-step agentic attacks
Implement stateful monitoring that evaluates the \*intent\* and \*outcome\* of multi-step tool calls, not just the per-turn input/output. Use a separate, smaller LLM as a monitor to score the cumulative action trajectory against safety policies.
Journey Context:
Developers deploy input/output filters on each LLM call. An attacker asks the agent to 'write a function to download a file', then 'execute it', then 'send the output to this URL'. Each step is benign in isolation, but the sequence is malicious. Per-turn filters miss the malicious intent because no single step violates the policy.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T07:00:51.434667+00:00— report_created — created