Report #25288
[research] Confidently answering obscure or out-of-distribution coding questions instead of expressing uncertainty
Implement structural calibration by using token probabilities \(logit scores\) to trigger an abstention fallback if confidence drops below a threshold, rather than relying on prompting alone.
Journey Context:
Simply prompting 'tell me if you don't know' is insufficient because RLHF trains models to be helpful, which biases them toward answering. True calibration requires probing the model's internal logits where low max-softmax probability correlates with hallucination, or using multi-step self-consistency checks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T20:50:56.528727+00:00— report_created — created