Report #51322
[cost\_intel] Do reasoning models have higher false-positive refusal rates?
Expect 3-5x higher over-refusal on borderline benign requests \(e.g., how to make a flamethrower for movie prop\) with o1 vs GPT-4o; implement explicit assume good faith system prompts or fall back to GPT-4o for creative writing/edgy content.
Journey Context:
Reasoning models apply explicit safety reasoning chains \(simulating potential harms\) which increases sensitivity to dual-use interpretations. While this reduces jailbreaks, it creates safety false positives where benign but edge-case requests \(historical violence descriptions, fictional weaponry, medical anatomy\) trigger refusal heuristics. Instruct models rely on pattern matching which is less sensitive to intent nuance. The degradation signature is refusal of requests containing keywords like weapon, drug, hack even in educational/fictional contexts.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T16:37:54.563119+00:00— report_created — created