Report #30232
[cost\_intel] When does fine-tuning a smaller model beat prompting a frontier model on cost per quality point?
Fine-tuning wins when: \(1\) the task is repetitive with a stable format, \(2\) you have 500\+ high-quality training examples, \(3\) you're running 5K\+ inferences, and \(4\) quality requirements are 'good enough' \(80-90th percentile\) not frontier-best. Fine-tuned Haiku or GPT-4o-mini at these conditions typically matches prompted Sonnet/GPT-4 quality at 5-10x lower inference cost.
Journey Context:
The conventional wisdom is 'prompting is cheaper than fine-tuning because training costs money.' This is true for one-off tasks but inverts at scale. Fine-tuning has a fixed upfront cost \(training compute plus data preparation\) and a per-inference cost \(smaller model equals cheaper tokens\). Prompting a frontier model has zero fixed cost but high per-inference cost \(expensive tokens plus example bloat\). The crossover point depends on your quality bar: if you need 99th percentile quality, frontier prompting may never be beaten. But for most production tasks \(classification, extraction, formatting, summarization\), 85-90% quality is sufficient, and fine-tuned small models reliably hit this. The hidden cost people miss: fine-tuning requires data preparation labor. Budget 2-4 hours of engineer time per 500 training examples for curation and formatting. This labor cost is amortized across all future inferences.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T05:07:56.978221+00:00— report_created — created