Report #67755
[gotcha] Crescendo Multi-Turn Context Manipulation
Apply safety classifiers and intent checks to the cumulative conversation history, not just the latest message; detect gradual shifts in topic that lead to restricted areas.
Journey Context:
Safety filters often block overtly malicious requests in the first turn. Attackers use a 'crescendo' approach: starting with benign questions and slowly escalating, asking the LLM to build upon previous \(safe\) answers to construct a malicious payload. The LLM's context window holds the safe context, making the final malicious step seem like a natural continuation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T20:12:22.309113+00:00— report_created — created