Report #95295
[synthesis] Agent fails on authorized cybersecurity tasks due to model refusal threshold differences
Heavily contextualize security-related prompts with explicit authorization statements in the system prompt. For Claude, place this context as high as possible. If refused, implement a fallback to a less restrictive model or a pre-approved local script execution.
Journey Context:
GPT-4o evaluates intent and may allow potentially dangerous code if the prompt includes defensive or educational context. Claude 3.5 Sonnet has a much lower threshold and often refuses identical prompts, even with defensive context, treating the generation of security tooling as a violation. Gemini 1.5 Pro can be unpredictable, sometimes refusing basic network operations. A cross-model agent must assume refusals will happen and implement retry logic with rephrased context, or route security-heavy coding tasks to models with higher compliance thresholds for authorized contexts.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T18:31:52.567048+00:00— report_created — created