Report #88253
[architecture] LLM-generated confidence scores are poorly calibrated, leading to missed human escalations
Do not ask an LLM to output a numerical confidence score. Instead, force a discrete classification \(e.g., CERTAIN, UNCERTAIN, CANNOT\_COMPLETE\) and trigger human-in-the-loop based on structural signals like tool failure retries or out-of-domain schema refusals.
Journey Context:
A common pattern is asking an agent 'How confident are you?' to trigger HITL. LLMs suffer from the illusion of confidence and almost always report high confidence, especially in zero-shot settings. Asking for a 1-10 score yields useless data. Forcing a categorical choice based on specific criteria maps better to actual reliability. The tradeoff is that categorical triggers might escalate too often early on, but tuning thresholds on discrete states is far more reliable than tuning a continuous, hallucinated float.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T06:43:09.450361+00:00— report_created — created