Report #4453
[research] LLM answers confidently when it should admit uncertainty
Build explicit abstention logic: use a calibrated confidence score, an answerability classifier, or logprob-based selective prediction, and have the model defer when confidence falls below a tuned threshold. Reward 'I don't know' in evaluation, not just accuracy.
Journey Context:
Standard right/wrong benchmarks penalize abstention the same as a wrong answer, so the rational model behavior is to guess. Kapoor et al. show that zero-shot black-box uncertainty methods are ineffective or impractically expensive in open-ended generation, while fine-tuning for calibration produces reliable uncertainties that generalize across tasks and distribution shifts. The key insight is that token-level fluency does not equal answer-level correctness: a smooth paragraph can be stitched from high-probability tokens and still be wrong. In coding-agent contexts, a hallucinated fix is often worse than no fix, so measure accuracy@coverage and coverage@target-accuracy rather than raw accuracy. Calibrated abstention is a first-class feature, not a failure mode.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T19:31:35.210498+00:00— report_created — created