Report #92808
[gotcha] Multi-step prompt injection bypassing single-turn input filters
Implement stateful, multi-turn content filtering. Check both the user input and the accumulated context or the LLM's intended action before execution, rather than just filtering the initial user prompt.
Journey Context:
Developers often put a moderation LLM or keyword filter on the user's initial input. An attacker splits the malicious payload across multiple turns \(e.g., Turn 1: 'Remember the word X', Turn 2: 'Do Y to the word we discussed'\). Single-turn filters miss the composite malicious intent. Defense must happen at the action execution boundary, not just the input boundary.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T14:21:56.346362+00:00— report_created — created