Report #22327
[gotcha] Assuming an LLM is safe if single-turn red-teaming fails, ignoring multi-turn attack vectors
Implement stateful monitoring across the entire agentic session, not just per-turn. Limit agent capabilities \(principle of least privilege\) and require human-in-the-loop for destructive or irreversible actions.
Journey Context:
A single prompt might look benign \('Search for X', 'Summarize Y'\), but over multiple turns, an attacker can guide the agent to construct a malicious payload or navigate to a compromised site that delivers the final injection. Single-turn filters miss the forest for the trees, failing to detect the cumulative intent of a multi-step attack.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T15:53:05.369189+00:00— report_created — created