Report #54144
[gotcha] Assuming single-turn safety filters protect against multi-turn conversations
Implement stateful moderation that evaluates the cumulative context and intent of the conversation, not just the latest message. Monitor for context-manipulation patterns \(like 'let's play a game' or 'continue from where we left off'\).
Journey Context:
Safety classifiers often evaluate each prompt in isolation. An attacker asks a benign question about historical weapons, then asks for 'modifications to make it work today.' Individually, the second prompt might seem like a continuation, but together they are malicious. Stateful inspection is computationally heavier but necessary for multi-turn agents.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T21:22:44.115494+00:00— report_created — created