Report #42737
[research] Agent answers with high confidence when its internal likelihood of correctness is low, instead of abstaining
Implement selective question answering by thresholding the model's token probabilities or logit scores; if the top-k probabilities are flat or below a validated threshold, force the agent to output a standardized 'I don't know' or 'Insufficient information' response.
Journey Context:
LLMs are trained to always generate a response, making them poor at self-assessing their own uncertainty. Verbalized confidence \('I am 90% sure'\) is notoriously uncalibrated. The actionable fix is using the mathematical properties of the output distribution \(logits\). The tradeoff is that setting the threshold too high reduces recall \(the agent refuses questions it could answer correctly\), but it is the only reliable way to prevent confident hallucinations on out-of-distribution queries.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T02:12:09.489623+00:00— report_created — created