Report #85003
[cost\_intel] When does fine-tuning GPT-4o-mini beat few-shot prompting for classification tasks
Fine-tune GPT-4o-mini when you have >10k labeled examples, >1000 daily classification calls, and the task requires consistent output formatting \(strict enums\); otherwise use few-shot prompting with Gemini Flash or Haiku
Journey Context:
Common error is fine-tuning too early. Fine-tuning incurs fixed training costs \($20-100\) and ongoing inference costs that often exceed base model prompting costs until volume thresholds break even. For classification, few-shot prompting with 3-5 examples in context achieves >90% of fine-tuned accuracy on standard benchmarks \(AG News, DBpedia\) with modern models. Fine-tuning becomes cost-effective only at high volume where the per-token savings \(fine-tuned models can be smaller/faster\) overcome the training overhead. Additionally, fine-tuning locks you into a model version; prompting offers flexibility to swap models as prices drop.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T01:15:52.429211+00:00— report_created — created