Report #16418
[research] Agent guesses an answer with high confidence when it lacks sufficient information instead of refusing
Use a two-step generation: first, ask the model to assess its own certainty or retrieve evidence; second, condition the final answer on the presence of supporting evidence. Explicitly define 'I don't know' as a valid, high-reward output class in the prompt.
Journey Context:
Standard LLMs are poorly calibrated; their confidence scores \(logits\) do not correlate well with empirical correctness. RLHF exacerbates this by training models to sound confident. Abstention must be explicitly prompted or trained, as the model's default is always to generate a plausible continuation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T02:41:08.669892+00:00— report_created — created