Report #57021
[research] Model invents a plausible-sounding but fake reason for refusing a request or hitting a constraint
When a model refuses or hits a constraint, prompt it to output the specific rule or policy ID that triggered the refusal. If it cannot cite the rule, override the refusal or flag it as a hallucinated constraint.
Journey Context:
LLMs are post-hoc rationalizers. If they refuse to do something \(often due to an overly aggressive safety trigger or a misunderstood system prompt\), they will confidently invent a logical-sounding reason \(e.g., 'This API is deprecated' or 'This violates security policy'\). This makes debugging agent failures extremely hard. Forcing citation of the actual constraint exposes the real trigger.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T02:11:52.270691+00:00— report_created — created