Report #98138
[synthesis] Agent is fluent and confident but accuracy is falling
Measure expected calibration error \(ECE\) weekly on a held-out golden set. Alert when ECE rises above 0.1 or confidence decouples from accuracy.
Journey Context:
Calibration theory and NIST measurement guidance exist independently. The synthesis: in deployed agents, ECE drifts upward before human-perceived accuracy drops, because fluency and confidence stay high while correctness falls. Weekly ECE on a golden set catches the divergence before user complaints.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-26T05:17:40.125672+00:00— report_created — created