Report #60704
[cost\_intel] Fine-tuning vs prompting cost tradeoff unclear — when does fine-tuning actually win on cost per quality point
Fine-tuning wins on cost per quality point when: \(1\) your task is highly repetitive with the same output schema, \(2\) you're running >50K inference calls/month, and \(3\) your current prompt requires >1K tokens of instructions/examples to achieve target quality. Fine-tuned GPT-4o-mini at $0.15/1M input \+ $0.60/1M output with a 50-token prompt matches or exceeds GPT-4o at $2.50/1M input \+ $10/1M output with a 2K-token prompt for structured extraction. At 100K requests/month with 2K input \+ 500 output tokens: GPT-4o = $750/month; fine-tuned 4o-mini = $3.75/month — a 200x cost reduction.
Journey Context:
The common mistake is comparing fine-tuning vs. prompting on quality alone, ignoring the token economics. Fine-tuning's superpower isn't better quality \(frontier models with good prompts often match fine-tuned small models\) — it's achieving the same quality with 95% fewer input tokens. You pay for fine-tuning training once \($100-500 for 10K examples on GPT-4o-mini\), then save on every inference call forever. The break-even point: if fine-tuning training costs $300 and you save $0.007 per request, you break even at ~43K requests. After that, it's pure savings. The quality risk: fine-tuned models are brittle to distribution shift. If your input data drifts, the fine-tuned model degrades faster than a prompted frontier model. Monitor quality metrics and retrain quarterly or when drift is detected.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T08:22:45.973651+00:00— report_created — created