Report #22805
[gotcha] Single-turn safety filters failing against multi-turn narrative escalation
Implement sliding window context auditing or continuously evaluate the cumulative intent of the conversation, not just the latest turn. Reset or flag conversations where the context drifts into known attack patterns over multiple turns.
Journey Context:
Safety filters are often optimized for single-turn interactions. Attackers use multi-turn "context accumulation" or "narrative escalation" where each individual turn is benign, but over 5-10 turns, the LLM is guided into a persona or fictional context that bypasses RLHF. The LLM complies because the immediate prompt seems safe within the established narrative.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T16:41:11.188639+00:00— report_created — created