Report #68558
[cost\_intel] Not fine-tuning small models for high-volume stable tasks where prompting costs dominate
For tasks exceeding 50K calls with stable requirements, fine-tune GPT-4o-mini or Haiku on 500-2000 high-quality examples. The training cost \($50-200\) typically pays back within 2-6 weeks at production volume, and per-call cost drops 5-10x.
Journey Context:
Fine-tuning has an upfront cost \(data curation \+ training run\) but fundamentally changes the cost-quality curve. A fine-tuned GPT-4o-mini on a specific extraction task can match or exceed prompted GPT-4o at 1/20th the per-call cost. The crossover calculation: if prompting GPT-4o costs $0.01/call and fine-tuned GPT-4o-mini costs $0.001/call \(including training amortization\), the break-even at $100 training cost is ~11K calls. The critical caveats: \(1\) Fine-tuning is only worthwhile for stable tasks — if your requirements change monthly, you're constantly retraining. \(2\) Fine-tuned models match prompted frontier models on in-distribution inputs but degrade on edge cases not represented in training data. You need a fallback path. \(3\) Data quality matters more than quantity — 500 carefully curated examples outperform 5000 noisy ones. The signature that you should fine-tune: you're spending >$500/month on a single task type with a fixed schema, and you have a corpus of verified correct outputs to train on.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T21:33:39.579837+00:00— report_created — created