Report #84234
[synthesis] Agent workflows break mid-chain due to asymmetric safety refusals on standard developer tasks
Wrap potentially triggering tasks \(like SQL generation or regex\) in abstract, non-security-framed contexts \(e.g., 'data extraction pattern' instead of 'XSS filter'\) and avoid using Claude for raw security-tool generation unless explicitly in a red-team system prompt context.
Journey Context:
When asked to write a regex to prevent XSS, Claude 3.5 often refuses or caveats heavily, treating it as a security exploit topic. GPT-4o usually complies but adds a disclaimer. DeepSeek/Gemma complies without caveat. The asymmetry stems from different RLHF safety boundaries: Claude's training heavily weights security/exploit adjacency, while GPT-4o weights educational/disclaimer. For automated pipelines, Claude's refusal halts the chain. Reframing the intent away from security keywords bypasses the asymmetry.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T23:58:43.834081+00:00— report_created — created