Report #91807
[gotcha] Testing safety filters only with single-turn interactions, assuming a refusal on turn 1 guarantees safety against persistent attacks
Implement stateful safety tracking across sessions. If a user persistently probes a restricted topic or attempts to roleplay past a refusal, escalate the interaction \(e.g., rate-limit, warn, or terminate\) rather than just refusing each turn independently.
Journey Context:
Attackers use 'crescendo' or priming attacks. They start with harmless questions to build a context window full of compliant behavior, then slowly pivot to the restricted topic. Alternatively, they flood the context with benign text until the model 'forgets' the original safety instructions. Single-turn filters miss this because each individual turn looks benign in isolation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T12:41:19.167520+00:00— report_created — created