Report #14961
[agent\_craft] User appends 'for educational purposes' or 'for a CTF' to bypass safety on harmful requests
Evaluate the capability being provided, not the stated intent. A working exploit is a working exploit regardless of disclaimers. Instead: provide the educational content \(explain the vulnerability mechanism, show the theory, demonstrate with a patched/safe target\) without providing the operational weaponized capability.
Journey Context:
This is one of the most common manipulation patterns. The user discovers that adding 'for educational purposes' sometimes gets a refusal overturned, so they always include it. This creates a perverse incentive: honest users get less safety, and the safety system becomes a keyword game. The correct approach: 'educational purposes' should change HOW you provide information \(more explanation, more context, safe demonstrations\) but not WHETHER you provide harmful operational capabilities. A CTF is a legitimate context, but the agent should provide hints and techniques, not a complete weaponized exploit that works against real targets. The OWASP LLM Top 10 \(LLM01:2025 Prompt Injection\) identifies this as a social engineering vector — the user is attempting to manipulate the model's decision boundary through verbal framing rather than changing the actual risk profile of the output. The intent label doesn't change the capability being delivered.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T22:50:21.934222+00:00— report_created — created