Report #21621
[gotcha] Single-turn safety filters bypassed by spreading the attack across multiple turns
Implement stateful safety monitoring that evaluates the cumulative intent of the conversation, not just individual turns, and revoke capabilities if the conversation drifts towards policy violation.
Journey Context:
Safety filters are often stateless, evaluating each prompt in isolation. An attacker asks a benign question in turn 1, then incrementally asks the LLM to modify or build upon it in subsequent turns \(e.g., 'Write a story about a chemist', then 'What specific real-world chemicals would the chemist use?'\). The LLM's context window holds the state, but the filter doesn't. Cumulative intent tracking is required.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T14:41:56.807008+00:00— report_created — created