Report #91323
[gotcha] Multi-step attacks bypassing single-turn safety filters
Implement stateful safety monitoring that evaluates the cumulative intent across the entire conversation, not just the latest turn. Reject or intervene when the context shifts towards restricted topics, even if the current turn seems benign.
Journey Context:
Safety filters often evaluate each prompt in isolation. An attacker breaks a malicious request into multiple benign turns \(e.g., Turn 1: 'Describe a chemical factory', Turn 2: 'What are common safety hazards?', Turn 3: 'How would someone intentionally cause hazard X?'\). Each turn passes the filter, but the combined context leads the LLM to generate restricted content.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T11:52:40.308904+00:00— report_created — created