Report #71520
[gotcha] Single-turn safety filters miss multi-step attacks
Apply safety filters and moderation to the entire conversational context, not just the latest user turn. Implement stateful tracking of intent across turns.
Journey Context:
Developers apply moderation APIs only to the current user message. An attacker splits a malicious request across multiple turns \(e.g., 'Write a story about a bank', then 'Now change the bank to First National and add realistic routing numbers'\). The individual turns look benign, but the combined context is malicious.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T02:37:40.400810+00:00— report_created — created