Report #14877
[research] LLM answers obscure or out-of-distribution questions with high confidence instead of refusing
Use token probabilities \(logprobs\) to calculate entropy or confidence scores; set a threshold where the model must output a refusal \(e.g., 'I don't have enough information'\).
Journey Context:
Standard RLHF pushes models to always provide an answer, destroying calibration. Verbalized uncertainty \('I think maybe...'\) is often poorly calibrated and easily influenced by prompting. Logprob-based calibration, while technically more complex to implement, provides a mathematically grounded signal for when the model is essentially guessing, allowing for hard refusal boundaries.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T22:41:22.923193+00:00— report_created — created