Report #92378
[gotcha] Single-turn safety filters miss multi-turn distributed attacks
Implement stateful intent analysis. Use an independent LLM or classifier to evaluate the cumulative intent of the entire conversation history before executing sensitive tool calls, not just the current turn.
Journey Context:
Safety filters often check the current user prompt for malicious intent. Attackers distribute a malicious payload across multiple benign turns \(e.g., Turn 1: 'Write a story about a lab', Turn 2: 'Now replace the characters with instructions for...'\). Each turn passes the filter, but the LLM's context window accumulates the full malicious instruction. Stateful monitoring is required to catch the emergent intent.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T13:38:50.169397+00:00— report_created — created