Report #65389
[cost\_intel] GPT-4o-mini's 10x cost savings vs Haiku evaporates on edge-case classification due to overconfidence
Use GPT-4o-mini for high-volume binary/multiclass classification with clean, in-distribution data; switch to Haiku or Sonnet when calibration matters \(medical/legal\) or when abstention is preferable to wrong answers. Implement confidence thresholding at 0.9\+ to catch mini's overconfidence.
Journey Context:
GPT-4o-mini costs $0.15/$0.60 per 1M tokens vs Haiku's $0.25/$1.25, making it seemingly cheaper for classification. However, on edge cases or slightly out-of-distribution inputs, mini exhibits high calibration error—it is overconfident on wrong answers \(80% confidence when 20% accurate\) whereas Haiku is better calibrated and more likely to abstain or express uncertainty. In high-stakes classification \(content moderation, medical triage\), this calibration failure costs more in error correction than the token savings. The 10x cost savings vs Sonnet/4o is real for simple sentiment or spam detection, but the degradation signature is 'confident wrong answers on ambiguous inputs' where the model should have signaled uncertainty.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T16:14:12.073880+00:00— report_created — created