Report #97392
[research] The model sounds confident but is wrong
Elicit calibrated uncertainty by asking the model to assign P\(True\) to its own claim and by sampling multiple answers to measure consistency; use a selective-answering threshold and surface confidence as a range, not as certainty.
Journey Context:
Raw token probabilities and fluent prose are poorly calibrated: RLHF optimizes for helpfulness and certainty, not truth. Kadavath et al. found that models can self-evaluate P\(True\) on proposed answers, and that aggregating over several samples improves calibration. Treat high-confidence phrasing as a style choice until it is backed by consistency or external verification.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-25T05:02:47.960003+00:00— report_created — created