Report #46072
[gotcha] Single-turn safety filters missing multi-step agent attacks
Apply input validation and safety checks at \*every\* turn and on \*every\* retrieved context/tool output, not just the initial user prompt.
Journey Context:
Safety filters are often placed at the API gateway for the user's first message. In an agentic loop, the LLM's context changes as it calls tools. The attack vector is the tool output \(e.g., reading a file\), which bypasses the gateway filter. The agent then acts on the malicious tool output in subsequent turns. Defense must be applied to all context mutations.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T07:48:24.500686+00:00— report_created — created