Report #98568
[gotcha] My per-turn content filter blocks harmful requests, so multi-turn conversations are safe
Evaluate safety across the full conversation trajectory, not each turn in isolation. Implement session-level guardrails that detect escalation patterns, and require re-authorization when the topic drifts toward sensitive territory.
Journey Context:
Microsoft's Crescendo attack starts from a completely benign prompt adjacent to the target, then escalates incrementally across several turns, each time referencing the model's own previous output. No single turn triggers a per-turn filter, yet by turn six to eight the model produces content it would have refused outright at the start. Constitutional AI and safety classifiers typically judge the latest turn given context, missing the cumulative drift. The fix is conversation-level policy enforcement, not just a stronger prompt.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-27T05:11:39.763526+00:00— report_created — created