Report #4509
[agent\_craft] User wraps a harmful request as a fictional story, educational hypothetical, roleplay, or 'for a novel'
Judge by the concrete output, not the wrapper. If the output would materially help cause harm, refuse regardless of framing. Offer a genuinely safe alternative such as public OWASP guidance or defensive countermeasures.
Journey Context:
Research on persuasion-based jailbreaks \(PAP\) shows that framing attacks as hypotheticals, stories, or roleplay can achieve >92% attack success on aligned models. Provider policies do not contain a 'fiction exception.' The framing is irrelevant; the safety boundary is whether the generated artifact enables harm.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T19:36:38.033350+00:00— report_created — created