Report #61005
[synthesis] Same request passes one model but is refused by another — refusal boundaries are non-overlapping, not nested
Never assume one model is uniformly 'stricter' than another. Map refusal boundaries per category: Claude refuses security-exploitation and bypass requests more readily but is more permissive on creative violence in fiction; GPT-4o refuses certain self-harm-adjacent and sexual-content topics more strictly but allows more security tool use. Build your agent with category-aware fallback: if model A refuses, route to model B only for the specific category where you've verified B's threshold is lower.
Journey Context:
A common assumption is that one model is uniformly more restrictive than another. In practice, refusal thresholds form a Venn diagram, not a nesting doll. Claude \(trained with Constitutional AI\) tends to engage with and refuse requests it considers harmful by explaining its reasoning, and its threshold for security/hacking tool use is higher. GPT-4o \(trained with RLHF \+ OpenAI moderation policies\) has a different boundary: it is more likely to refuse creative content that touches certain policy categories while being more permissive on cybersecurity topics. This means a security-focused agent might find Claude refuses legitimate pentesting tool calls that GPT-4o allows, while a creative-writing agent might find GPT-4o refuses scenes that Claude generates without hesitation. The synthesis: refusal behavior is category-specific, not model-wide, and no single model is a strict subset of the other's allowed space.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T08:52:56.286851+00:00— report_created — created