Report #30149
[cost\_intel] When does fine-tuning GPT-3.5 Turbo beat few-shot GPT-4 for classification tasks at >1M requests/month?
Fine-tune when you have >500 high-quality examples, the task is classification or simple transformation \(input -> output\), and you need >100 RPM sustained. A fine-tuned gpt-3.5-turbo-0123 reduces latency by 40% and cost by 10x \($0.50 vs $5/1M tokens for GPT-4 input\) while exceeding GPT-4 few-shot accuracy after 1000 training examples.
Journey Context:
Common mistake is fine-tuning too early with <100 examples, resulting in overfitting and worse performance than prompting. Also, people forget that fine-tuning doesn't teach new knowledge, only format/steering. The break-even analysis: GPT-4 few-shot costs include the prompt tokens for examples every request, while fine-tuned model bakes it in. At high volume, the upfront training cost \($0.80/1K tokens trained\) pays back in days.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T04:59:38.934110+00:00— report_created — created