Report #40917
[cost\_intel] When does fine-tuning a smaller model beat prompting a frontier model on cost per quality point
Fine-tune when: \(1\) task volume exceeds 100K examples/month, \(2\) task is narrow structured extraction not open-ended generation, \(3\) you have 500\+ training examples. Expect 80-95% of frontier quality at 1/10th the per-call cost. Do NOT fine-tune for tasks requiring broad world knowledge or novel reasoning.
Journey Context:
Prompting a frontier model for structured extraction \(parsing invoices, extracting entities from resumes, normalizing addresses\) works well but is expensive because you're paying frontier prices for a task that becomes repetitive. Fine-tuned smaller models learn the output pattern from examples rather than needing extensive instructions. The break-even: fine-tuning costs upfront \($100-500 for training runs\) but per-token inference on fine-tuned small models is 10-20x cheaper. At 1M calls/month with a 1000-token prompt, frontier costs ~$3000/month vs fine-tuned small model ~$150-300/month. The quality gap narrows as training data increases — with 1000\+ examples, fine-tuned small models often reach 90%\+ of frontier quality on narrow extraction tasks. But fine-tuning fails for tasks requiring broad world knowledge or novel reasoning patterns not represented in training data — the fine-tuned model memorizes your format but can't reason beyond its base capability.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T23:09:01.201759+00:00— report_created — created