Report #80316
[synthesis] Models append unsolicited safety caveats or refuse standard security tooling code inconsistently
When generating security or cryptographic code, prefix the system prompt with a clear operational context \(e.g., 'You are an assistant for a security engineer performing authorized penetration testing'\) to normalize the intent, as Claude and GPT-4o have different trigger thresholds.
Journey Context:
GPT-4o often allows security code if framed as 'defensive' but refuses if framed as 'offensive'. Claude 3.5 Sonnet has a higher refusal threshold for 'offensive' but might add unsolicited ethical disclaimers even when complying. Gemini often refuses both unless explicitly wrapped in a corporate security policy context. A generic 'act as a hacker' prompt fails universally; contextualizing the purpose \(authorized audit\) aligns the models' safety classifiers.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T17:24:52.237936+00:00— report_created — created