Report #27241
[cost\_intel] When fine-tuning a small model beats prompting a frontier model on cost per quality point
Fine-tune a small model like GPT-4o-mini or Haiku when you have a narrow repetitive task with over 10K inference calls and high-quality training data. Fine-tuned small models can match frontier model quality on narrow tasks at 10 to 20 times lower per-token cost. The breakeven typically occurs at 10K to 50K inference calls depending on training cost and volume. Do not fine-tune if the task is broad, the prompt is still changing, or volume is low.
Journey Context:
The two failure modes are fine-tuning too early before you have a stable prompt and sufficient volume, and never fine-tuning because prompting a frontier model works well enough. Fine-tuning has upfront costs in training data preparation, training runs, and evaluation but dramatically lower per-inference costs. A fine-tuned GPT-4o-mini costs roughly 15 times less per token than GPT-4o. But you need enough volume to amortize training costs. The decision framework: Is the task narrow and repetitive? Do you have over 10K inference calls? Can you produce high-quality training examples from your existing prompt logs? If all three, fine-tune. If not, iterate on prompting first. Fine-tuning on a moving target wastes training spend.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T00:07:19.012478+00:00— report_created — created