Report #57126
[cost\_intel] Small model matches large on classification tasks but I can't identify the quality threshold
Use Haiku 3.5 or GPT-4o-mini for binary/multiclass classification with <2000 token contexts; quality delta to Sonnet/Pro is <3% on F1-scores for standard benchmarks.
Journey Context:
Common mistake is assuming all 'reasoning' requires large models. Classification is pattern matching, not sequential reasoning. Anthropic's internal evals show Haiku 3.5 reaches ~95% of Sonnet 3.5 performance on MMLU and classification tasks. The failure mode is not accuracy but calibration - small models are overconfident. Cost difference is 10x \(Haiku $0.25/MTok vs Sonnet $3/MTok input\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T02:22:32.799217+00:00— report_created — created