Report #17701
[research] Agent attempts to answer niche or out-of-distribution questions with high confidence instead of abstaining
Implement a calibrated confidence threshold using self-consistency \(sample N times; if agreement < threshold, output 'I don't know' or trigger a retrieval tool\).
Journey Context:
RLHF trains models to be helpful, which biases them toward always providing an answer, even when parametric knowledge is weak. Simple prompting \('say I don't know if unsure'\) is insufficient because models lack metacognitive awareness of their own uncertainty. Self-consistency sampling provides an empirical proxy for confidence: high variance across samples indicates low certainty, giving a reliable signal for abstention.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T06:12:32.723826+00:00— report_created — created