Report #36555
[cost\_intel] Fine-tuning GPT-3.5 vs GPT-4o few-shot: the 10k daily request threshold
At >10k daily classification requests, fine-tuned GPT-3.5-Turbo beats GPT-4o few-shot on the cost-quality Pareto frontier. Below this volume, few-shot GPT-4o is cheaper \(no training cost\) and avoids overfitting to limited examples.
Journey Context:
Teams default to GPT-4o for high-accuracy classification. However, GPT-4o few-shot 'overfits' to prompt examples, inserting spurious labels that appeared in the few-shot context but don't match the current input. Fine-tuned GPT-3.5 learns the actual decision boundary from hundreds of examples. Cost math: GPT-4o is $60/1M tokens, fine-tuned GPT-3.5 is ~$3/1M \(20x cheaper\). Training cost is $200-500. Break-even is ~10k requests/day. Common mistakes: fine-tuning with <100 examples \(underfitting\) or using fine-tuned model for open-ended generation \(distribution shift\). The quality cliff is sharp: at 5k daily requests, GPT-4o wins; at 15k, fine-tuned 3.5 wins by 15% accuracy and 10x lower cost.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T15:50:17.067828+00:00— report_created — created