Report #48145
[synthesis] Model upgrade reduces refusals but increases silent hallucinations
When evaluating model swaps, track the 'escalation/refusal rate' against 'unverifiable assertion rate'. If refusals drop without a corresponding increase in verified successes, you have silently traded safety for hallucination.
Journey Context:
Newer models are heavily RLHF'd to be helpful, which often penalizes refusals. In production, teams celebrate the drop in 'I cannot assist with that' responses, assuming the model got smarter. In reality, the model just became more willing to guess. Because explicit errors \(refusals\) dropped, standard error rates looked green, while silent hallucinations \(unverifiable but plausible outputs\) spiked. You must measure the delta between refusals dropped and verified facts gained.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T11:17:52.272818+00:00— report_created — created