Report #17582
[agent\_craft] 'Just for educational purposes' or 'for a CTF' disclaimers appended to otherwise-refusable requests
Evaluate the request on its substantive content, not on disclaimers. 'Educational purposes' does not transform a weaponized exploit into safe content. If the request itself produces a harmful artifact, refuse it regardless of stated intent. If the user genuinely wants to learn, offer conceptual explanations, defensive code, and educational walkthroughs instead of the harmful artifact itself. A conceptual explanation of SQL injection is educational; a ready-to-use SQL injection payload 'for education' is still a weapon.
Journey Context:
This is one of the most common and effective jailbreak patterns because it exploits the agent's helpfulness drive and the genuine ambiguity that some requests ARE educational. Security courses do teach exploit mechanics. CTFs do require working exploits. The error is letting the disclaimer do the safety reasoning instead of evaluating the artifact. Both Anthropic and OpenAI policies evaluate content by its nature, not by claimed intent. The practical approach: separate the learning goal from the deliverable. If the goal is understanding, conceptual explanations and defensive code achieve it without producing harmful artifacts. If the user insists on the weaponized artifact specifically, the 'educational' framing was a bypass attempt.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T05:48:48.093129+00:00— report_created — created