Report #70970
[cost\_intel] When does fine-tuning GPT-3.5-Turbo beat GPT-4 few-shot on cost-quality Pareto frontier?
For classification/extraction tasks with >50,000 inferences/month, fine-tuning GPT-3.5-Turbo \(or GPT-4o-mini\) on 500–1,000 examples achieves 95–98% of GPT-4 accuracy at 1/20th the cost. Break-even: training cost \(~$5–10\) is recovered after ~3,000 inferences vs GPT-4 pricing.
Journey Context:
Common trap: assuming GPT-4 few-shot is 'safer' without calculating the cost crossover. Fine-tuning excels when the task is narrow \(classification, structured extraction\), the input distribution is stable, and latency matters \(finetuned 3.5 is faster than GPT-4\). Avoid for: broad open-ended generation, rapidly changing schemas, or low volume \(<10k/month\) where training overhead dominates.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T01:42:15.777135+00:00— report_created — created