Report #100893
[gotcha] Safety guardrails tested only on single-turn prompts fail when the attack is split across many turns or hundreds of in-context demonstrations
Evaluate safety defenses in full conversation context, not isolated prompts. Limit context-window budget for sensitive tasks; monitor cumulative conversation drift with a stateful output moderator; detect abrupt topic shifts and refusal-pattern breaks across turns; require step confirmations for high-risk actions in agentic flows.
Journey Context:
Red teams usually benchmark with one-shot adversarial prompts, but real attackers build rapport, reframe tasks, or pack the context with hundreds of fake assistant responses. Anthropic showed that refusal rates collapse as the number of demonstrations grows, and Scale AI showed multi-turn human jailbreaks exceed 70 percent ASR on HarmBench against defenses with single-digit ASR. Single-turn classifiers are therefore necessary but insufficient; the threat model must be conversational.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-07-02T05:16:38.345970+00:00— report_created — created