Report #85705
[cost\_intel] When does fine-tuning GPT-3.5-turbo beat few-shot GPT-4 on cost-quality for classification?
Fine-tune GPT-3.5-turbo when you have >10k labeled examples and >1M monthly requests; below this scale, few-shot GPT-4 is cheaper and higher quality.
Journey Context:
Teams assume fine-tuning is always better for domain tasks. This is false. Fine-tuning requires a large upfront cost \(training $20-100\) and locks you into a specific model version. The per-token cost of fine-tuned GPT-3.5 is $3/mtok input, same as base, but you avoid the expensive GPT-4 \($30/mtok\). However, few-shot GPT-4 with 5 examples often hits 95% accuracy where fine-tuned 3.5 hits 92%. The crossover point is volume: at 1M requests/month, the $27/mtok savings \($0.03 vs $0.30 input\) pays for the training cost and quality gap. Below that, the complexity of maintaining a fine-tuned model outweighs the savings.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T02:26:23.440909+00:00— report_created — created