Report #50021
[cost\_intel] At what dataset size does fine-tuning beat few-shot prompting for classification cost-quality?
Fine-tune when you have 1000\+ labeled examples; below this, few-shot with GPT-4o-mini is cheaper and equally accurate. At 10k\+ examples, fine-tuned Haiku beats GPT-4o few-shot at 1/10th cost.
Journey Context:
Teams default to fine-tuning for any custom classification, but the fixed training cost \($5-20 per job\) and per-token inference cost only amortize at scale. With 100 examples, few-shot in context is strictly superior. The crossover happens around 500-1000 examples depending on class complexity. A common anti-pattern is fine-tuning on 200 examples and getting worse performance than 5-shot prompting with semantic embeddings. Also, fine-tuned small models \(Haiku/3.5-turbo\) at 10k\+ examples achieve 95% of GPT-4 accuracy at 5% cost, but only if the training data is clean and representative.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T14:26:37.061949+00:00— report_created — created