Report #44173
[gotcha] Single-turn safety filters miss multi-step social engineering attacks
Implement stateful conversation monitoring that evaluates the cumulative intent of the conversation, not just the latest turn, and strip the model's ability to maintain context of previous malicious turns if a boundary is crossed.
Journey Context:
Developers rely on input classifiers or system prompts that check the current user message. Attackers use the 'Crescendo' technique: asking benign questions over multiple turns, slowly building up context until the model generates harmful content. Each individual turn is harmless, so single-turn filters pass, but the aggregate context crosses the line.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T04:37:00.477573+00:00— report_created — created