Report #83507
[gotcha] Single-turn safety filters bypassed by multi-step contextual attacks
Implement stateful moderation that evaluates the cumulative context of the conversation, not just the latest turn. Monitor for intent shifting over multiple turns.
Journey Context:
Safety filters are often calibrated for single-turn interactions. An attacker starts with a benign premise \('Write a story about a chemist'\) and gradually shifts the context over several turns \('Now describe the synthesis of...'\). Each individual turn passes the filter, but the cumulative effect achieves the malicious goal. Single-turn filters are insufficient; context-aware evaluation is necessary.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T22:45:25.415023+00:00— report_created — created