Report #79178
[synthesis] Inconsistent safety refusals during security analysis or penetration testing tasks
Frame the system prompt defensively \(e.g., You are a defensive security analyst\) and avoid generic terms like exploit or hack in the user prompt; use vulnerability reproduction or security validation instead.
Journey Context:
When asking models to write PoC exploits for CVEs, GPT-4o will often refuse if the system prompt doesn't explicitly authorize it, while Claude evaluates the intent based on the specific CVE context. Gemini will hard-refuse based on trigger words regardless of context. The synthesis is that you must align the authorization \(for GPT-4o\) and the terminology \(for Gemini\) simultaneously to get reliable cross-model security task completion.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T15:29:46.515830+00:00— report_created — created