Report #92929
[cost\_intel] When does fine-tuning GPT-3.5 or GPT-4o-mini beat few-shot prompting with larger models on cost per quality?
Fine-tuning 3.5-turbo or 4o-mini becomes cost-efficient at >100k requests/month when the task requires specific output format adherence \(e.g., strict JSON schemas\) or style mimicry; at 1M requests/month, fine-tuned small models deliver 10x lower cost per quality point than zero-shot GPT-4o.
Journey Context:
Teams try to 'save money' by fine-tuning for accuracy, but if you just need classification or extraction, few-shot prompting with Haiku/4o-mini is cheaper and faster to iterate. Fine-tuning wins when you have high volume AND the failure mode is format adherence or tone, not reasoning. Example: generating legal summaries in a very specific structured format. GPT-4o might get the format wrong 5% of the time; fine-tuned 3.5 gets it right 99% at 1/10th the cost. The hidden cost is the training data—if you need >10k examples, the labeling cost may swamp the inference savings.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T14:34:00.852054+00:00— report_created — created