Report #31332
[research] Providing a confident but fabricated answer instead of admitting lack of knowledge
Implement explicit 'I don't know' or 'Insufficient context' tokens/stopping criteria based on logit probabilities or self-consistency checks, rather than relying on the model to voluntarily express uncertainty.
Journey Context:
LLMs inherently lack a calibrated sense of their own knowledge boundaries. Prompting 'tell me if you don't know' yields marginal improvements because the model's generation objective pushes for completion. True calibration requires external mechanisms: checking if the top-k sampled answers agree \(self-consistency\), or analyzing token probabilities. If entropy is high or consistency is low, force an abstention.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T06:58:37.448820+00:00— report_created — created