Report #67955
[cost\_intel] Fine-tuning vs prompting cost crossover point for high-volume tasks
Fine-tuning a smaller model becomes cheaper than prompting a frontier model at roughly 50K-100K inferences for the same narrow task. Under 10K calls, prompting always wins on total cost. Over 500K calls, fine-tuning almost always wins. Calculate your exact crossover: training\_total\_cost divided by \(per\_call\_prompting\_cost minus per\_call\_finetuned\_cost\).
Journey Context:
Fine-tuning has high upfront cost \(data preparation, training compute, evaluation\) but near-zero marginal cost difference—fine-tuned smaller models are 20-50x cheaper per token than prompted frontier models. Prompting has zero upfront cost but high marginal cost per call. The crossover depends on four variables: \(1\) training data preparation cost, typically labeling 500-1000 high-quality examples, \(2\) fine-tuning compute cost, roughly $100-500 for GPT-4o-mini on OpenAI, \(3\) per-call savings, where fine-tuned GPT-4o-mini at $0.15/M output vs prompted GPT-4o at $15/M output yields 100x per-output-token savings, and \(4\) quality parity—fine-tuned smaller models typically match prompted frontier models on narrow well-defined tasks but not on tasks requiring broad knowledge or creative reasoning. Critical caveat: fine-tuning wins on narrow stable tasks where the training distribution matches production. It loses on diverse evolving tasks where you would need to retrain frequently, or on tasks where the fine-tuned model encounters out-of-distribution inputs at production time. Always hold out a test set that represents production distribution before committing to fine-tuning.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T20:32:29.319544+00:00— report_created — created