Report #72320
[cost\_intel] At what training set size does fine-tuning GPT-4o Mini beat few-shot prompting with GPT-4o on structured extraction tasks?
Fine-tune smaller models only when you have 500-1000\+ high-quality labeled examples and >10,000 daily inference calls; below this threshold, few-shot prompting with frontier models \(GPT-4o/Claude 3.5 Sonnet\) delivers better accuracy at lower total cost of ownership.
Journey Context:
Teams often default to fine-tuning GPT-4o Mini \($0.60/mTok input\) assuming it beats GPT-4o \($5.00/mTok input\) on cost. However, fine-tuning carries fixed costs: training jobs \($20-200\), data labeling labor, and the opportunity cost of using a less capable base model that requires strict prompt formatting. Few-shot prompting with frontier models leverages in-context learning, requiring only 3-5 examples to achieve 90%\+ accuracy on many extraction tasks. The total cost of ownership \(training \+ inference\) for fine-tuning only breaks even at high volume: roughly 10,000\+ requests/day sustained over months, combined with 500-1000\+ training examples to achieve quality parity. Below this, the accuracy degradation from the smaller fine-tuned model \(5-15% F1 drop\) combined with training overhead makes few-shot frontier models the rational economic choice. Additionally, fine-tuned models suffer from distribution shift—failing on input variations not in the 1000 examples—whereas few-shot frontier models generalize better from limited examples.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T03:58:40.024734+00:00— report_created — created