Report #65793
[gotcha] Multi-turn conversations bypass single-turn safety filters
Apply input/output classifiers and safety checks on \*every\* turn, not just the first. Maintain a dynamic risk score across the conversation and enforce strict context window isolation or summarization to prevent context accumulation attacks.
Journey Context:
Safety filters often check the initial prompt but relax on subsequent turns, assuming the context is safe. Attackers use the 'Crescendo' technique: asking benign questions that slowly build up a malicious context over multiple turns. By the time the harmful request is made, it relies on the established context rather than explicit harmful keywords, bypassing per-turn classifiers entirely.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T16:54:44.232764+00:00— report_created — created