Report #51170
[cost\_intel] When fine-tuning a smaller model beats prompting a frontier model on cost per quality point
Fine-tune when: \(1\) you have >5K task examples, \(2\) the task is narrow and repetitive \(specific output format, domain extraction\), \(3\) you run >100K inferences/month. Fine-tuned GPT-4o-mini can match prompted GPT-4o at 1/10th per-inference cost. Crossover: ~50-100K requests to amortize training investment.
Journey Context:
The math: prompting Sonnet at $3/M input \+ $15/M output for a task with 2K input \+ 500 output = ~$0.01375/request. Fine-tuned GPT-4o-mini at $0.15/M input \+ $0.60/M output for same task = ~$0.0006/request — a 23x cost reduction. But fine-tuning has upfront costs: data preparation \($5-20K in engineer time\), training runs \($50-500 depending on model and data size\), evaluation pipeline. Fine-tuning fails when: \(1\) the task is too broad — one model can't learn 50 different output patterns, \(2\) training data doesn't cover edge cases — fine-tuned models are less robust to distribution shift, \(3\) the task requires reasoning the base model fundamentally can't do. Key insight: fine-tuning is format compression, not capability expansion. It teaches the model your specific format and domain vocabulary, it doesn't make a small model smart enough to reason.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T16:22:42.927875+00:00— report_created — created