Report #90444
[cost\_intel] Haiku/Flash fails silently on classification tasks with >20 classes or imbalanced base rates
Use Sonnet/Pro when class count >20 or minority class frequency <5%; implement confidence thresholding \(e.g., logprob < -0.5\) to catch Haiku's overconfident misclassifications on tail classes.
Journey Context:
Benchmarks on multi-label classification show Haiku achieves 94% accuracy on top-5 classes but drops to 67% on classes with <100 training examples, while Sonnet maintains 89%. Error mode: Haiku assigns high probability to common classes when uncertain \(calibration error 0.25 vs Sonnet 0.08\). Cost tradeoff: Haiku \+ confidence filtering \+ human escalation for low-confidence items yields 40% cost savings vs Sonnet on all items, with <2% quality degradation. Without filtering, Haiku produces silent quality cliffs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T10:24:21.204019+00:00— report_created — created