Report #53128
[cost\_intel] When does fine-tuning actually reduce cost per quality point versus few-shot prompting?
Fine-tune GPT-4o-mini only when you have >10,000 examples, the task is domain-specific \(legal/medical terminology\), and latency matters. For <5,000 examples, 5-shot prompting with GPT-4o is cheaper and higher quality. Fine-tuning loses to prompting on novel distributions.
Journey Context:
The 'fine-tuning is cheaper' myth persists. OpenAI's fine-tuning pricing includes training costs \($0.008/1K tokens for 4o-mini\) plus inference \($0.6/1M input, $2.4/1M output\). But the real cost is generalization: fine-tuned models overfit to the training distribution. If your production data drifts 10%, the fine-tuned model degrades catastrophically while the base model with few-shot prompting adapts instantly. Break-even analysis: at 1M requests/day, fine-tuning saves ~$200/day in inference but costs $5,000\+ to train. You need 25\+ days of volume to break even, assuming zero distribution shift—which never happens.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T19:40:20.227523+00:00— report_created — created