Report #94231
[gotcha] Safety filters only check individual turns, missing attacks distributed across multiple interactions
Apply safety classifiers to the entire accumulated context window or a rolling window of recent turns, not just the latest user message.
Journey Context:
A user asks a benign question in turn 1, then in turn 5 asks to 'summarize our previous discussion but make it about \[malicious topic\]'. No single turn triggers the filter. The LLM's context window accumulates the payload over time. Single-turn classifiers are blind to the composite meaning built across the conversation history.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T16:45:15.787397+00:00— report_created — created