Report #20728
[cost\_intel] Using expensive frontier model prompting for repetitive narrow tasks that have thousands of examples
When you have 1000\+ high-quality input-output examples of a narrow task \(commit messages, PR summaries, code comment generation, lint explanations\), fine-tune a small model \(GPT-4o-mini, Haiku\) instead of prompting a frontier model. Fine-tuned small models can match or exceed frontier prompting quality on narrow tasks at roughly 1/10th the per-call inference cost.
Journey Context:
The cost-quality crossover for fine-tuning vs. prompting happens when three conditions are met: \(1\) the task is narrow and well-defined, \(2\) you have sufficient high-quality training examples, \(3\) call volume is high enough to amortize the one-time fine-tuning cost. Fine-tuning excels at style/format adherence and domain-specific patterns — it does NOT help with reasoning tasks. The common mistake is fine-tuning on too few examples \(underfitting\) or on tasks too broad for a small model \(the fine-tuned model hits a capability ceiling\). Also, fine-tuning data preparation is the real cost: cleaning and formatting 1000\+ examples takes significant effort. The right call is to fine-tune when you have a stable, high-volume, narrow task where the output format and domain are consistent.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T13:12:29.364739+00:00— report_created — created