Report #23126
[cost\_intel] Using frontier model prompting for high-volume narrow tasks that a fine-tuned small model would handle cheaper and better
Fine-tune a small model \(GPT-4o-mini, Haiku\) when: \(1\) you have 500\+ labeled examples, \(2\) the task is narrow and repetitive \(classification, entity extraction, format conversion, code style linting\), \(3\) you run 50K\+ inference calls/month. At this volume, fine-tuned-small typically matches or exceeds prompted-large quality at 10-20x lower per-call cost.
Journey Context:
Fine-tuning shifts cost from recurring inference to one-time training. A fine-tuned small model at ~$0.15/1M input tokens vs. a frontier model at ~$3/1M input tokens is a ~20x cost difference. The quality crossover happens because fine-tuning internalizes the task pattern — the model doesn't need lengthy instructions and examples in every prompt, which also reduces input token count by 50-80%. The failure modes: \(1\) fine-tuning for tasks requiring broad reasoning — the model memorizes patterns but can't generalize beyond its training distribution; \(2\) fine-tuning on stale data — as your task distribution shifts, the model degrades and requires retraining; \(3\) fine-tuning with too few examples — under 100 examples often produces worse results than good prompting with examples. Rule of thumb: if your system prompt is over 2K tokens of task-specific instructions and few-shot examples, and the task is repetitive, you're a fine-tuning candidate. The hidden savings: fine-tuned models need shorter prompts, so you save on both model cost and token cost simultaneously.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T17:13:21.816746+00:00— report_created — created