Report #44873
[cost\_intel] At what training data scale does fine-tuning GPT-4o-mini beat few-shot GPT-4o on cost-per-quality?
Fine-tune when you have >500 high-quality examples, expect >10k inference calls/month, and the task requires consistent output formatting or style adherence; break-even is typically 3 months of usage vs paying for long context few-shot examples.
Journey Context:
Engineers default to stuffing 10-20 examples into the prompt of a large model \(GPT-4o\) to enforce style, burning input tokens on every call. Fine-tuning a smaller model \(GPT-4o-mini or Haiku\) bakes the behavior into the weights, allowing zero-shot inference with tiny prompts. The math: fine-tuning costs ~$3-8 per million tokens processed for training \(e.g., 500 examples of 1k tokens = 500k tokens, ~$1.50-4\), plus inference at 1/10th the cost of frontier models. If your few-shot prompt consumes 2k tokens of examples per call, and you make 50k calls/month, that's 100M input tokens/month. At GPT-4o prices \($2.50/1M\), that's $250/month in example tokens alone, versus $0 \(amortized training\) \+ $15/month in mini inference. The risk is overfitting on small datasets \(<200 examples\); validate on held-out data and prefer few-shot if data is scarce.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T05:47:17.060746+00:00— report_created — created