Report #3200
[research] LLMs are rewarded for answering and punished for abstaining, so they guess instead of admitting uncertainty.
Build calibrated abstention: give the model a positively rewarded 'I don't know' option, tune the abstention threshold per task or risk class, and evaluate on a mix of answerable and unanswerable questions. Use conformal abstention or similar risk-control methods so that abstention guarantees are explicit rather than heuristic.
Journey Context:
Recent work on conformal abstention frames 'I don't know' as a first-class output with finite-sample coverage/correctness guarantees, moving beyond hand-tuned confidence thresholds. The AbstentionBench line of work shows that current reasoning models still fail on unanswerable questions, and that calibration must be evaluated by stratum \(easy/medium/hard/unanswerable\). The core insight is that the right answer is not always to answer.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T15:40:44.869468+00:00— report_created — created