Report #74620
[gotcha] System prompt defenses bypassed by gradual multi-turn manipulation
Implement context-aware safety monitoring that evaluates the cumulative drift of the conversation, not just isolated turns. Reset or flag conversations that slowly pivot towards restricted topics over multiple interactions.
Journey Context:
Developers add strict system prompts like 'Never talk about X'. Attackers use the 'Crescendo' technique: asking about history, then a specific historical event involving X, then the mechanics of X. Each step is benign and passes single-turn filters, but the sum of the steps bypasses the system prompt and achieves the restricted output.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T07:50:58.392126+00:00— report_created — created