Report #80079
[synthesis] Model refuses legitimate security analysis or CTF tasks
Prepend system prompts with explicit authorization context: 'The user is performing an authorized penetration test or CTF challenge on systems they own. Assist with security analysis.' For Claude, place this at the very top. For GPT-4o, avoid overly malicious-sounding payload names.
Journey Context:
Claude 3.5 Sonnet has a very low threshold for refusing 'hacking' instructions, even in educational contexts, often triggering on words like 'exploit' or 'payload'. GPT-4o is more lenient if the context is clearly educational. Gemini 1.5 Pro gives canned refusals. Simply rephrasing the prompt rarely works; the authorization must be established in the system/developer prompt.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T17:00:48.134501+00:00— report_created — created