Report #57902
[cost\_intel] Over-provisioning frontier models for straightforward text classification tasks
Use Haiku/Flash/GPT-4o-mini for standard classification tasks \(sentiment, topic, intent, spam, PII detection\). These models typically match frontier models within 2-5% accuracy at 10-20x lower cost. Reserve frontier models for classification that requires deep domain expertise, nuanced contextual understanding, or resolving genuinely ambiguous edge cases.
Journey Context:
Classification is the sweet spot for small models. The output space is bounded, the reasoning is shallow, and the task is well-defined. Benchmarks consistently show small models at 95%\+ of frontier performance. The cost difference is dramatic: Claude 3.5 Haiku at $0.80/1M input vs Claude 3.5 Sonnet at $3/1M input; Gemini 1.5 Flash at $0.075/1M input vs Gemini 1.5 Pro at $1.25/1M input. The hidden failure mode: tasks labeled 'classification' that secretly require multi-step reasoning. 'Classify this email as urgent' might require understanding project dependencies, deadlines, and organizational context — that is not really classification, it is reasoning with a classification output format.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T03:40:52.476939+00:00— report_created — created