Report #43226
[gotcha] Multi-turn conversations bypassing single-turn safety filters
Evaluate the combined context of the conversation for malicious intent, not just the latest user message. Implement sliding window intent checks and reset conversation context if manipulation is detected.
Journey Context:
Safety filters often only inspect the immediate user input. Attackers use a 'divide and conquer' approach, asking the LLM to perform small, seemingly harmless steps across multiple turns that cumulatively achieve a restricted goal \(e.g., asking for chemical synthesis one reagent at a time\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T03:01:47.587256+00:00— report_created — created