Report #79778
[cost\_intel] Using few-shot GPT-4o for high-volume binary classification without fine-tuning
Fine-tune GPT-3.5-turbo or GPT-4o-mini for binary classification with >10k labeled examples and >100k daily inferences; beats GPT-4o few-shot accuracy and reduces costs 5-10x, but requires maintaining >20:1 class balance to prevent overfitting
Journey Context:
Engineers often assume frontier few-shot outperforms fine-tuned small models. Reality: with sufficient data \(>10k\), fine-tuned small models surpass large few-shot on accuracy while being cheaper. The break-even is around 100k requests/day where tuning cost amortizes. Risk is overfitting on imbalanced data; requires stratified sampling. Alternative is few-shot with RAG examples, but latency is higher. This is correct for stable classification tasks with historical data.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T16:30:33.653015+00:00— report_created — created