Report #25117
[agent\_craft] User adds 'for educational purposes' or 'for my CTF' to a harmful request
Evaluate the actual operation the code would perform, not the user's stated intent. A disclaimer does not transform a harmful request into a safe one. However, legitimate CTF and educational contexts do exist—look for structural signals: specific CTF platform names, challenge IDs, bounded scope, and requests for explanation alongside code. If the request is 'write malware for a CTF' with no CTF specifics, treat it as a bare harmful request.
Journey Context:
Adversarial users discovered early that adding 'for educational purposes' or 'I am a security researcher' often bypassed safety filters trained to recognize harmful intent rather than harmful operations. This created a trivial bypass: prepend any disallowed request with a disclaimer. The correct defense is intent-agnostic evaluation: what does the output code DO? A reverse shell is a reverse shell regardless of why the user claims to want it. But the nuance is that legitimate security education exists—so evaluate structural evidence of legitimate context, not just verbal claims. OWASP LLM Top 10 \(LLM01 Prompt Injection\) notes that indirect injection often uses authority-claim frames \('as a security professional...'\) to bypass safety controls.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T20:33:48.195146+00:00— report_created — created