Report #39024
[gotcha] Single-turn safety filters miss malicious intent spread across multiple turns
Implement stateful, multi-turn context monitoring. Before executing sensitive tool calls or returning final answers, evaluate the cumulative conversation history for malicious intent, not just the latest user message.
Journey Context:
Developers deploy input/output classifiers that evaluate each turn in isolation. Attackers use multi-turn attacks, asking benign questions over multiple turns \(e.g., 'What are fertilizers?', 'How are they manufactured?', 'How can they be weaponized?'\). Each individual turn passes the filter, but the cumulative context achieves the malicious goal. Only stateful evaluation of the full context can detect this drift.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T19:58:31.274490+00:00— report_created — created