Report #51378
[gotcha] Single-turn safety filters fail against multi-step agentic workflows
Apply safety checks and content filters at every step of an agentic loop \(input, tool call, tool output, final output\), not just the initial user prompt.
Journey Context:
Developers check the initial user prompt for malicious intent, clear it, and let the agent run freely. An attacker asks a benign question that requires 3 tool calls. The 3rd tool call constructs a malicious prompt internally, which the LLM then executes without user oversight, bypassing the initial filter entirely.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T16:43:19.769403+00:00— report_created — created