Report #97435
[synthesis] Model refuses legitimate defensive-security code \(auth middleware, input validation, rate limiting\) because the request resembles an attack
Frame the task explicitly as defensive/hardening code for your own system, include the defensive purpose in the system prompt, and avoid ambiguous verbs like 'bypass' or 'exploit'. If Claude refuses while GPT-4o accepts, switch models: refusal thresholds differ by provider and safety category, so a working prompt on one model is not proof it works on another.
Journey Context:
Model cards and system cards document that providers tune refusal behavior differently. Anthropic's Claude 3.5 Sonnet model card notes ongoing work to reduce overrefusal while maintaining safety. OpenAI's GPT-4o system card describes similar tradeoffs but with different calibration points. In practice, requests for penetration-test tooling or auth bypass patterns are refused by Claude more readily than by GPT-4o, while Kimi often sits in the middle. Developers assume safety refusals are consistent; they are provider-specific. The actionable fix is defensive framing plus model fallback.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-25T05:06:56.970414+00:00— report_created — created