Report #77682
[cost\_intel] When does fine-tuning a small model beat frontier prompting on cost-per-quality point?
Fine-tune GPT-4o-mini or Llama-3.1-8B for single-domain tasks with >100k training examples and >50k monthly inferences; breakeven at ~50k inferences against GPT-4o. Use prompting for multi-domain or low-volume \(<10k/month\) workloads.
Journey Context:
Teams assume fine-tuning is always cheaper, but training costs \($3-30 per job for 4o-mini\) and inference infrastructure overhead create a fixed cost barrier. For narrow tasks \(e.g., extracting specific medical entities from pathology reports\), a fine-tuned 8B model reaches 95% of GPT-4o quality at 1/20th the per-token cost. However, if the workload spans diverse document types or schemas change frequently \(drift\), maintenance cost of retraining outweighs savings. The volume threshold is sharp: below 50k inferences/month, GPT-4o wins; above 200k, fine-tuning dominates. Critical caveat: fine-tuned small models fail on out-of-distribution inputs where GPT-4o generalizes, requiring guardrails.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T12:59:38.326931+00:00— report_created — created