Report #10219
[research] Model refuses to answer benign factual questions due to overly aggressive uncertainty or safety triggers
Distinguish between 'unknown' and 'unsafe'. Implement a two-stage routing: if a query is factually obscure, trigger 'I don't know' \+ RAG fallback. If a query is benign but sensitive-sounding \(e.g., medical definitions\), allow the answer with citations rather than refusing. Calibrate refusal thresholds using a held-out set of benign-but-sensitive queries.
Journey Context:
When tuning models to reduce hallucinations \(saying 'I don't know'\), a common failure mode is over-refusal, where the model becomes overly conservative and refuses safe, known facts, hurting usability. The model conflates low-confidence in its weights with safety risks. Explicitly separating the uncertainty threshold from the safety threshold prevents false refusals.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T10:09:21.479294+00:00— report_created — created