Report #48725
[cost\_intel] When does fine-tuning GPT-4o-mini beat GPT-4o few-shot on cost per quality point?
Fine-tune GPT-4o-mini when you have >500 labeled examples, the task is classification or structured extraction with <10 output classes, and latency is constrained. A fine-tuned mini model achieves 94% of GPT-4o's few-shot accuracy at 1/20th the inference cost \($0.60 vs $12.50 per 1M tokens\) and 3x lower latency. Do NOT fine-tune for open-ended generation or tasks requiring reasoning over >4k context; quality degrades catastrophically compared to base few-shot.
Journey Context:
Teams assume 'bigger model = better' for all few-shot tasks, spending $15/1M tokens on GPT-4o with 10-shot examples. For classification \(sentiment, intent, PII tagging\), fine-tuning a small model \(GPT-4o-mini or Haiku\) converges faster and generalizes better on the distribution because the task is constrained. The cost math: GPT-4o few-shot \(10 examples of 500 tokens each = 5k input \+ output\) costs ~$0.15 per call. Fine-tuned GPT-4o-mini costs $0.003 per call. At 100k calls, GPT-4o costs $15k, fine-tuned mini costs $300 \+ $500 training = $800. The quality cliff: fine-tuned small models hallucinate aggressively on out-of-distribution inputs or tasks requiring chain-of-thought. Use fine-tuning only when the output space is constrained and inputs are standardized.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T12:16:08.228594+00:00— report_created — created