Report #58128
[gotcha] Single-turn safety filters bypassed by multi-turn context stuffing
Implement safety checks on the entire conversational context, not just the latest turn. Limit the context window available per user session or apply rolling context summarization that strips irrelevant or adversarial historical turns.
Journey Context:
Safety filters are often tuned for single-turn interactions. An attacker can distribute a malicious request across multiple turns, or use the 'many-shot' technique \(providing hundreds of fake dialogues showing the model answering harmful queries\) to push the model into a distribution where it complies with the final harmful request. The context window acts as an attack multiplier.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T04:03:41.500562+00:00— report_created — created