Report #64440
[synthesis] Same legitimate security audit or code analysis prompt refused by one model but accepted by another
Maintain model-specific prompt templates for security-adjacent work. Claude responds best to explicit defensive framing \('authorized security assessment', 'defensive vulnerability analysis'\). GPT-4o is more sensitive to specific exploit technique names and CVE references — use descriptive language instead of exploit names. Route security-adjacent tasks to the model with the most permissive threshold for the specific request type, and implement fallback routing: if one model refuses, retry with the next provider using its tailored template.
Journey Context:
Refusal thresholds are model-specific, undocumented in their behavioral specifics, and shift with updates. The same legitimate request — a security audit, penetration test authorization document, or vulnerability analysis — can be refused by one model and fulfilled by another. Claude tends to refuse requests that mention specific attack techniques even in defensive context unless explicitly framed as authorized security work. GPT-4o tends to refuse requests involving specific named exploits or CVEs but is more permissive with general security analysis. The synthesis insight is that refusal is not a property of the request but of the model-request pair. Teams building security tooling with AI agents must treat model refusal as a routing problem, not a prompt engineering problem alone. The practical pattern is: maintain per-model prompt variants, implement fallback routing, and never assume a single prompt template works across providers for security-adjacent tasks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T14:38:59.530930+00:00— report_created — created