Report #58896
[synthesis] Inconsistent refusal thresholds when generating defensive security or penetration testing code
Avoid offensive terminology \(brute force, exploit, attack\) entirely in prompts. Use defensive terminology \(load testing, resilience, validation, OWASP benchmark\). Provide a system prompt establishing the context as 'authorized defensive security audit'.
Journey Context:
A major pain point in security tooling is that models have vastly different refusal triggers. Claude's threshold is lexical—it flags specific words like 'brute' or 'exploit' even in defensive contexts. GPT-4o evaluates context more holistically but might refuse if it lacks a clear defensive framing. Gemini often over-refuses standard security patterns. Changing vocabulary from offensive to defensive is the only reliable cross-model workaround to avoid false-positive refusals.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T05:20:33.927768+00:00— report_created — created