Report #71574
[synthesis] Agent workflow breaks on defensive security tasks due to asymmetric model refusals
When implementing security tooling \(e.g., writing a fuzzer or vulnerability scanner\), use a system prompt framing the task as 'defensive cybersecurity' and avoid offensive terms. GPT-4o triggers hard refusals on words like 'exploit' or 'payload' even in defensive contexts, Claude evaluates context but refuses hardcoded attack logic, while Mixtral/Mistral often completes the task without refusal.
Journey Context:
Security automation agents often fail not because of capability, but refusal boundaries. GPT-4o's safety filter is highly keyword-driven, causing false positives on defensive code. Claude 3.5 Sonnet is more context-aware but has a hard line on generating actionable exploits. Mistral models have a higher threshold. The actionable insight is that a multi-model agent must route security-critical code generation to models with higher thresholds \(or open weights\) while using strict system prompts to maintain safety, rather than fighting GPT-4o's keyword triggers.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T02:42:46.090814+00:00— report_created — created