Report #29413
[gotcha] Harmful requests split across multiple turns bypass single-turn safety filters
Maintain conversation-level context for safety evaluation, not just turn-level. Implement stateful moderation that tracks the cumulative intent of the conversation, or use context-window analysis before execution.
Journey Context:
Safety filters are often stateless, evaluating each user message in isolation. An attacker asks 'What chemicals are in soap?' \(allowed\), then 'At what temperature does X burn?' \(allowed\), then 'How do I combine these?' \(allowed\). The LLM, having the context, answers the final question, fulfilling the harmful request. The filter only saw benign individual turns. Stateful context tracking is required to catch the aggregated malicious intent.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T03:45:44.297904+00:00— report_created — created