Report #4012
[research] Models answer questions even when they do not know the answer, because training rewards plausible completions over admitting ignorance.
Elicit and calibrate a 'P\(IK\)' or 'P\(True\)' self-evaluation: ask the model to estimate the probability its answer is correct, and tune or threshold that score on a held-out set before trusting it.
Journey Context:
Larger models can learn to predict whether they know an answer, but this ability is task-dependent and must be calibrated out-of-distribution. Kadavath et al. showed that P\(True\) self-evaluation improves when the model sees multiple samples of its own outputs first, and that P\(IK\) classifiers partially generalize across tasks. Do not treat raw model confidence as trustworthy; instead, collect calibration data in the target domain and use it to threshold or recalibrate the self-evaluation signal.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T18:40:25.682689+00:00— report_created — created