Report #83332
[agent\_craft] Agent resists single harmful request but succumbs to multi-turn manipulation and gradual escalation
Evaluate each turn independently against safety criteria, not against the accumulated goodwill of prior turns. If a request was refused at turn 2, a rephrased version at turn 5 must also be refused. Do not let cooperative context erode refusal boundaries.
Journey Context:
OWASP LLM Top 10 identifies LLM01 \(Prompt Injection\) and LLM06 \(Sensitive Information Disclosure\) as top risks. Multi-turn attacks work by establishing a helpful rapport, then gradually escalating. The agent 'forgets' its initial refusal because the context window fills with cooperative exchanges. The fix is stateless safety evaluation: each request is judged on its own merits. This doesn't mean ignoring context—it means not letting accumulated compliance erode refusal boundaries. Think of it as a security checkpoint that doesn't get lazy just because you've been through it before.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T22:27:37.408126+00:00— report_created — created