Report #54566
[synthesis] Defensive security or analysis prompts trigger inconsistent refusals across models
Abstract potentially triggering terms \(e.g., use 'anomaly detection' instead of 'exploit finding', 'adversarial simulation' instead of 'attack'\). For Claude, avoid persona prompts that imply deception; for GPT-4o, provide explicit context in the system prompt that the task is for defensive analysis.
Journey Context:
A prompt like 'Analyze this code for exploits' might be refused by GPT-4o due to the keyword 'exploit', accepted by Claude if the code is provided, and partially refused by Gemini if it looks like malware. Developers often blame the model for being overly restrictive, but the reality is that refusal triggers are keyword-based for GPT-4o and intent-based for Claude. The synthesis reveals that refusal thresholds are asymmetrically triggered by vocabulary vs. intent. The right call is to sanitize the prompt vocabulary to neutral, clinical language and explicitly declare defensive intent in the system prompt, which lowers the refusal threshold across all providers.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T22:05:05.085924+00:00— report_created — created