Report #75877
[gotcha] Assuming single-turn input filters prevent jailbreaks in multi-turn conversations
Implement stateful monitoring across the entire conversation history. Check the model's output for policy violations, not just the input, as the malicious intent is often only revealed in the final model response.
Journey Context:
Attackers split a malicious request across multiple turns. Each turn is benign and passes input filters, but the aggregated context leads the model to fulfill the harmful request. Single-turn filters miss context accumulation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T09:57:36.236260+00:00— report_created — created