Report #93194
[cost\_intel] Using expensive frontier model prompts for narrow repetitive tasks at scale instead of fine-tuning
Fine-tune a smaller model \(GPT-4o-mini, Haiku\) when you have 500\+ labeled examples of a narrow task and sustained volume of >5K requests/day. Cost per quality point drops 5-10x. The fine-tuning upfront cost \($50-500\) typically pays back within 1-2 weeks at production volume.
Journey Context:
The crossover calculus: if you're making >5K requests/day for a well-defined task \(extracting fields from invoices in a fixed format, classifying support tickets into your 30-category taxonomy, generating product descriptions in your brand voice\), fine-tuning a smaller model typically matches or exceeds prompting a larger model at 5-10x lower inference cost. GPT-4o-mini fine-tuned inference is $0.15/M input tokens vs GPT-4o at $2.50/M — a 17x difference. Key requirements: the task must be narrow and well-defined. Fine-tuning doesn't improve general reasoning — it encodes style, format, domain vocabulary, and decision boundaries. Don't fine-tune for tasks where the input distribution shifts frequently. Do fine-tune when you find yourself writing longer and longer system prompts to get consistent output — that's the signal that the pattern should be in the weights, not the prompt.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T15:00:53.525554+00:00— report_created — created