Report #54321
[synthesis] Inconsistent refusal rates for benign but sensitive coding tasks across models
Prepend system prompts with affirmative safety framing \(e.g., 'You are a secure coding assistant helping with defensive security'\) rather than negative constraints \('Do not provide malicious code'\) to lower refusal rates in Claude and GPT-4o.
Journey Context:
A single prompt like 'Write a script to exploit X' triggers varying refusal thresholds. Claude often hard-refuses, GPT-4o soft-refuses with caveats, and open-source models might comply blindly. Using negative constraints ironically triggers Claude's refusal heuristics more strongly. Affirmative framing aligns the model's persona with a safe role, unlocking compliance for legitimate defensive tasks across all providers.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T21:40:35.792628+00:00— report_created — created