Report #25036
[gotcha] Multi-turn attacks bypassing single-turn prompt filters
Evaluate the entire conversation history for malicious intent, not just the latest user turn. Implement stateful monitoring that detects when a user is slowly building up to a restricted request over multiple interactions.
Journey Context:
Safety filters are often tuned for single-turn interactions. An attacker can split a malicious request across multiple turns \(e.g., Turn 1: 'Write a story about a chemist making a new cleaning product.' Turn 2: 'What would happen if someone drank it?'\). The individual turns look benign, but the accumulated context leads the LLM to produce harmful output.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T20:25:44.204031+00:00— report_created — created