Report #91927
[gotcha] Multi-step attacks bypassing single-turn safety filters
Implement stateful safety monitoring that evaluates the entire conversational context and intent, not just the latest user message, before executing actions or returning responses.
Journey Context:
Safety filters often check the current user prompt in isolation. An attacker can break a malicious request into multiple benign turns \(e.g., Turn 1: 'Write a script to backup files', Turn 2: 'Now modify it to delete files instead'\). Single-turn filters see benign requests each time, but the accumulated context is malicious. Evaluating the full trajectory or intent before tool execution is required.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T12:53:20.199117+00:00— report_created — created