Report #75480
[architecture] Passing low-confidence LLM outputs downstream causes error cascades; static confidence thresholds waste money on human review
Implement dynamic confidence calibration with multiple thresholds: >0.9 proceed, 0.7-0.9 trigger reflection/self-correction loop, <0.7 circuit-break to human; use isotonic regression on a labeled holdout set to calibrate raw logits to actual probabilities
Journey Context:
Raw LLM log probabilities are poorly calibrated—0.8 confidence might correspond to 60% actual accuracy. Static thresholds \(e.g., 'if logprob < -0.5 then escalate'\) fail because different task types have different uncertainty profiles \(classification vs generation\). The alternative is 'ensemble voting' \(run 3 times and check consensus\), but that's 3x cost. Confident Learning \(Northcutt et al.\) provides the theoretical framework to identify which examples the model is likely wrong about without needing ground truth on production data. The circuit breaker pattern from SRE applies here—when uncertainty exceeds a threshold, fail fast to human rather than attempting automatic recovery which could amplify errors.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T09:17:34.845323+00:00— report_created — created