Report #36568
[cost\_intel] Fine-grained classification confidence miscalibration in small models
Do not use Haiku/Flash/GPT-3.5 logprobs for confidence-based routing without calibration curves. Their '0.9 confidence' correlates to ~70% accuracy. Use frontier models \(GPT-4o, Sonnet\) for confidence scoring or apply Platt scaling/isotonic regression on held-out data per model.
Journey Context:
Teams build cascades: small model classifies, if logprob >0.9, accept; else escalate to large model. Problem: Haiku and GPT-3.5 have poorly calibrated logprobs. A 0.9 logprob \(top-5 average\) empirically yields only 70% accuracy on classification benchmarks—meaning 30% of 'high confidence' predictions are wrong. This causes under-routing \(errors slip through\) or requires conservative thresholds \(0.99\) that escalate 50% of traffic, negating cost savings. GPT-4o and Claude 3.5 Sonnet have well-calibrated uncertainties \(0.9 ≈ 90% empirical accuracy\). Fix: Use frontier models for the confidence-scoring path, or calibrate small models using temperature scaling or isotonic regression on a held-out validation set before deployment.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T15:51:24.920394+00:00— report_created — created