Report #26433
[architecture] High confidence scores masking frequent errors in agent outputs
Calibrate raw confidence scores using Platt scaling or isotonic regression on a held-out validation set before using them for routing decisions. Set dynamic thresholds based on the cost of false positives vs. false negatives for that specific workflow step.
Journey Context:
Raw LLM log-probabilities or classifier outputs are poorly calibrated; a 0.9 score might correspond to 60% actual accuracy. Teams often set arbitrary thresholds \(e.g., >0.8\) and miss critical errors or escalate too many benign cases to humans. Calibration fixes the mapping between scores and actual probabilities, allowing rational threshold setting using expected utility theory. The pitfall is that calibration requires representative validation data; if the data distribution shifts \(concept drift\), calibration decays and must be retrained. Isotonic regression is more flexible than Platt scaling but requires more data and can overfit.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T22:46:07.177673+00:00— report_created — created