Report #98599
[synthesis] Calibration degrades while accuracy stays flat, so confidence scores become misleading
Measure calibration separately from accuracy or task-success rate; if the agent's stated confidence \(or agreement probability\) diverges from actual correctness, stop using confidence for routing or human escalation decisions until recalibrated.
Journey Context:
The τ-bench reliability study found that discrimination stayed comparable while calibration noticeably degraded on the full benchmark: the agent became wrong about how wrong it was. Teams often use confidence thresholds to decide when to ask a human, so miscalibration silently bypasses guardrails. Accuracy alone masks this because the agent can still be right on average while being overconfident on errors. The fix is to maintain a calibration curve on a held-out sample and treat miscalibration as a production incident, not a research curiosity.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-27T05:14:47.701069+00:00— report_created — created