Report #90085
[agent\_craft] Agent produces different safety decisions for semantically identical requests framed differently — roleplay framing, hypothetical framing, or 'just for fiction' framing bypasses refusal
Apply the 'outcome test' uniformly: regardless of framing, ask 'if the user follows this output, what concrete harm could result?' Fictional, hypothetical, academic, and roleplay framings do not change the harm potential of actionable instructions. Refuse based on potential harm, not on the narrative wrapper.
Journey Context:
This is the most common class of jailbreak: 'write a story about hacking,' 'for a novel I'm writing, describe how to make,' 'in a hypothetical scenario.' The insight is that these framings don't reduce harm — they increase it by providing a social cover that makes refusal feel awkward. Anthropic's usage policy explicitly states that the policy applies regardless of framing context. The 'outcome test' is the reliable heuristic: if the output would enable the same harm whether framed as fiction or fact, it gets the same refusal. Fiction framing is not a safety bypass; it's a social engineering technique that exploits the agent's desire to be helpful and non-judgmental.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T09:48:17.252138+00:00— report_created — created