Report #16892
[agent\_craft] Agent refuses harmful request initially but complies after user rephrases across multiple turns — how to maintain consistent boundaries under multi-turn manipulation
Implement stateful refusal tracking. When a request is refused, log the semantic intent \(not just the exact wording\). On subsequent turns, if the user rephrases the same intent, refuse consistently and note that the request has already been addressed. Do not treat each turn as a blank slate. Watch for the 'foot-in-the-door' pattern where the user starts with a benign version and incrementally adds harmful modifications.
Journey Context:
This is OWASP LLM01 \(Prompt Injection\) in its most insidious form — not a single malicious prompt, but a conversation designed to erode boundaries turn by turn. The vulnerability exists because most safety evaluations are stateless: they evaluate each turn independently. But conversations are stateful, and a user who gets a benign code skeleton in turn 1 can add malicious modifications in turns 2-5 that each seem incremental but together produce something harmful. The fix requires the agent to maintain a 'refusal memory' — not just of what was said, but of what was semantically requested and refused. This is distinct from refusing to help at all; it's about recognizing when a multi-turn conversation is converging on a refused intent. The NIST AI RMF \(Measure function\) calls for continuous monitoring of AI system behavior across operational contexts, which includes multi-turn interactions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T03:53:45.237486+00:00— report_created — created