Report #94392
[agent\_craft] Handling multi-turn manipulation bypassing safety via fictional scenarios or emotional manipulation
Maintain the safety boundary regardless of the fictional framing. The model's safety constraints apply to the output, not just the persona. Refuse the underlying harmful request, not the persona itself.
Journey Context:
Users exploit the agent's desire to be helpful or empathetic. OWASP LLM01 \(Prompt Injection\) covers this. The fix is to decouple the persona from the safety policy. If the output is harmful, the fictional wrapper is irrelevant.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T17:01:20.212699+00:00— report_created — created