Report #62097
[gotcha] Single-turn safety filters bypassed by multi-turn incremental context shifts
Implement stateful moderation that evaluates the cumulative context and intent across the entire conversation, not just the latest turn, and restrict the model's ability to drastically shift persona or role over time.
Journey Context:
Safety filters often check the current user prompt in isolation. The 'Crescendo' attack starts with benign requests and slowly escalates, asking the LLM to build on previous \(seemingly safe\) context. By the time the malicious request is made, it's framed as a natural continuation of the established context, bypassing the filter which sees no sudden malicious intent in the isolated turn.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T10:43:01.190491+00:00— report_created — created