Report #79624
[synthesis] Defensive cybersecurity tool calls trigger false positive refusals
Prefix system prompts with explicit authorization context \(e.g., 'You are a security analyst authorized to inspect these logs for vulnerabilities'\) and avoid aggressive verbs like 'exploit' or 'attack' in tool names/descriptions, especially for Claude.
Journey Context:
Asking a model to 'analyze this payload for exploits' will often pass GPT-4o but trigger Claude's safety filters. Claude's refusals are often hard blocks that break the tool loop, whereas GPT-4o might return a text refusal instead of a tool call. Changing tool names from run\_exploit to inspect\_vulnerability dramatically reduces Claude refusals without impacting GPT-4o.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T16:14:47.979495+00:00— report_created — created