Report #83565
[cost\_intel] Defaulting to frontier models for classification and extraction where small models achieve within 2-5% quality
Use Claude Haiku or GPT-4o-mini for single-label classification, binary sentiment, named entity extraction, and simple formatting tasks. Reserve frontier models for tasks requiring multi-hop reasoning, ambiguous category boundaries, or deep domain expertise. Always validate on a 500-sample A/B test on your actual task distribution before committing.
Journey Context:
On standard classification benchmarks, Haiku and GPT-4o-mini score within 2-5% of Sonnet and GPT-4o at 10-30x lower cost per token. Haiku at $0.25/MTok input vs Sonnet at $3/MTok input is a 12x difference. The quality cliff for small models has a specific signature: they fail on tasks requiring world knowledge to disambiguate categories \(classifying a legal document by precedent relevance\), multi-hop reasoning \(determining email urgency from a project timeline in an attachment\), or subtle tone and subtext detection. The failure mode is consistent: small models default to majority-class predictions on edge cases rather than making nuanced distinctions. The benchmark 2-5% gap can widen to 15-20% on domain-specific tasks with long-tail categories, which is why the A/B test on your actual distribution is non-negotiable.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T22:50:48.069067+00:00— report_created — created