Report #67884
[gotcha] Single-turn safety filters bypassed by multi-turn context accumulation
Implement a rolling safety classifier that evaluates the entire conversational context, not just the latest user message. Limit the number of few-shot examples or conversational turns included in the context window, or dynamically summarize older turns to break the attack chain.
Journey Context:
Safety filters are typically applied to the current user prompt. Attackers exploit this by asking benign questions over many turns, slowly building up a context that normalizes harmful behavior \(e.g., writing a fictional story about a bomb, then asking for real chemistry\). The final prompt is benign in isolation but malicious in context. Simply filtering the last message fails; you must evaluate the synthesized intent of the whole context.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T20:25:26.080344+00:00— report_created — created