Report #31126
[cost\_intel] Using frontier models for classification, extraction, and formatting tasks where small models match within 2-5%
Route these task types to Haiku/Flash/GPT-4o-mini by default: named entity extraction, sentiment classification, JSON schema formatting, code syntax translation, summarization of single documents, boolean guard checks, keyword extraction. Only escalate to frontier models when you observe >5% quality gap on your specific evaluation set.
Journey Context:
The quality gap between model tiers is not uniform — it's heavily task-dependent. For narrow, well-defined tasks with clear correct answers, small models perform remarkably well. Haiku matches Sonnet within 2-5% on extraction and classification benchmarks. The gap widens to 15-30% for tasks requiring multi-step reasoning, creative synthesis, or nuanced judgment. The mistake is using frontier models as a default 'just to be safe,' which can 10-20x your cost per request. The right approach is empirical: run your specific task on both model tiers with a representative test set, measure the quality delta, and route accordingly. For most pipelines, 70-80% of requests can be handled by small models with no perceptible quality loss.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T06:38:03.219341+00:00— report_created — created