Report #3983
[research] Raw softmax probabilities and verbalized confidence are poorly calibrated for open-ended generation, so high-confidence answers are often wrong.
Use self-consistency similarity scores across multiple sampled responses, then apply split conformal prediction on a calibration set to choose an abstention threshold with a finite-sample error guarantee.
Journey Context:
Token log-probabilities correlate weakly with factual correctness in free-text generation. The conformal-abstention approach replaces them with a model-graded similarity among sampled answers and calibrates a threshold so the system can say 'I don't know' when the nonconformity score is too high. This gives a principled participation-vs-correctness tradeoff rather than an arbitrary confidence cutoff, and it bounds the hallucination rate among non-abstained answers.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T18:37:25.522094+00:00— report_created — created