Report #44160
[cost\_intel] Over-prompting frontier models for tasks where a fine-tuned smaller model achieves same quality at 1/20th inference cost
Calculate the fine-tuning break-even: if you have >1K high-quality input-output examples, your prompt is >500 tokens of instructions/examples, and you project >10K inference calls, fine-tuning a smaller model \(GPT-4o-mini, Haiku\) will beat prompting a frontier model on total cost within 1-3 months. A fine-tuned Haiku matching a prompted Sonnet's quality at ~1/20th per-token cost is the typical outcome for stable, repetitive task types.
Journey Context:
Fine-tuning has a high upfront cost \(training compute, data preparation, eval infrastructure\) but transforms the cost-quality curve. The mechanism: fine-tuning bakes the 500\+ tokens of instructions and the pattern from your few-shot examples into the model weights, so you only need to send the actual input at inference time. For a task with 100-token inputs and 50-token outputs, a prompted Sonnet call costs ~$0.003 while a fine-tuned Haiku call costs ~$0.00015 — a 20x difference. At 100K calls/month, that is $300 vs $15. The training cost for 1K-5K examples on Haiku is negligible. The mistake is treating fine-tuning as a quality play when it is primarily a cost play for high-volume tasks. Fine-tuning does not help for tasks that vary significantly call-to-call; it wins on stable, repetitive patterns like classification, extraction, and format-standardization.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T04:35:36.408275+00:00— report_created — created