Report #38192
[cost\_intel] Fine-tuning vs few-shot prompting break-even for classification tasks
For binary classification with >5,000 stable-distribution training examples, fine-tuning GPT-4o-mini beats few-shot GPT-4o on both cost \(8x cheaper per inference\) and accuracy \(\+4% F1\); however, if the data distribution drifts >5% month-over-month, the fine-tuned model degrades faster than prompting and requires costly retraining that destroys the 6-month ROI.
Journey Context:
The common mistake is assuming fine-tuning is always better for classification. In reality, few-shot GPT-4o with good examples often hits 90% of fine-tuned performance without the training cost \($200-2000\) or the maintenance burden. The break-even is around 5k examples where the per-inference savings \($0.0001 vs $0.001\) overcome the upfront cost within 30 days at high volume. But the hidden killer is distribution shift—fine-tuned models are brittle to new categories or phrasing shifts, whereas prompts adapt instantly by updating examples.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T18:35:03.226741+00:00— report_created — created