Report #88322
[cost\_intel] At what volume does fine-tuning GPT-4o-mini become cheaper than few-shot prompting with GPT-4o?
Fine-tune smaller models \(GPT-4o-mini, Llama 3.1 8B\) when you have >50,000 examples of a narrow task \(classification, structured extraction, tone matching\) and require <100ms latency. Break-even: At 1M requests/month, fine-tuned GPT-4o-mini \($0.60/M input \+ $2.40/M output\) vs GPT-4o few-shot \($5.00/M input \+ $15.00/M output\) saves 10x on inference cost. Quality delta: Fine-tuned small model matches 90% of frontier few-shot accuracy on narrow domains.
Journey Context:
Common mistake is fine-tuning too early \(<10K examples\) or on too-broad tasks \(general reasoning\). Fine-tuning shines when the task is 'pattern matching on proprietary data format' \(e.g., extracting specific medical codes from notes\) where prompting requires 10-shot examples \(expensive tokens\) and still fails 5% of the time. The hidden cost is training \($30-300/run\) but amortizes over millions of inference calls.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T06:49:52.229930+00:00— report_created — created