Report #64512
[gotcha] Testing safety filters only on single-turn interactions, assuming a safe model remains safe across long conversations
Implement stateful context monitoring. Periodically re-inject the core safety constraints deep in the context window, or use a separate classifier to evaluate the entire conversation history before executing sensitive tools.
Journey Context:
Attackers use 'context exhaustion' or roleplay over multiple turns. A single turn seems benign \('How do I make a cake?'\), but over 10 turns, the context shifts \('Now replace flour with chemical X...'\). The model loses track of the original safety instructions as they scroll out of the immediate attention window. Single-turn filters are completely blind to this gradual shift.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T14:46:04.097494+00:00— report_created — created