Report #97140
[cost\_intel] Premature fine-tuning of smaller models when dynamic few-shot prompting with frontier models is cheaper
Calculate break-even: Fine-tuning beats dynamic few-shot prompting only when monthly volume exceeds 10M tokens on a single narrow task AND the few-shot context exceeds 2k tokens per request. Below this threshold, use GPT-4o-mini or Haiku with 3-5 retrieved examples \(RAG\) instead of fine-tuning.
Journey Context:
Teams fine-tune GPT-3.5 or Haiku to save costs, incurring $200-500 training costs and maintenance debt. However, for tasks with <10k requests/month, the savings don't amortize. Example: Task requires 4k tokens of few-shot examples to achieve 95% accuracy. Fine-tuned model removes this 4k overhead. Cost per call: Fine-tuned 3.5 Turbo input $3/1M \+ output $6/1M. Base 4o-mini with 4k context: $0.15/1M input \+ $0.60/1M output \+ overhead. At 10k requests/month with 2k input/500 output: Fine-tuned cost: 10k \* \(2k\*$3/1M \+ 500\*$6/1M\) = $60 \+ $30 = $90. Few-shot with 4k extra tokens: 10k \* \(6k\*$0.15/1M \+ 500\*$0.60/1M\) = $9 \+ $3 = $12. Few-shot is cheaper and faster to iterate. Fine-tuning only wins when volume is massive or the task is so specific that even 10k examples in context don't help.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T21:37:57.118204+00:00— report_created — created