Report #46854
[cost\_intel] When does fine-tuning GPT-3.5-turbo beat GPT-4 prompting on cost-per-quality for domain tasks?
Fine-tune GPT-3.5-turbo when you have >1,000 high-quality labeled examples, the task is stylistically consistent \(e.g., specific JSON dialect, brand voice\), and the distribution is stable \(quarterly drift <15%\). A fine-tuned 3.5-turbo achieves 90% of GPT-4 quality at 1/20th the inference cost \($0.003 vs $0.06 per 1K tokens\). Do not fine-tune for tasks requiring broad world knowledge updates or rapid distribution shifts—the static training set becomes a liability within weeks.
Journey Context:
Teams assume fine-tuning is for 'accuracy' but it's actually for 'style and format adherence'. GPT-4 is a generalist; fine-tuned 3.5-turbo is a specialist. The cost-quality curve crosses when the 'format compliance' tax on GPT-4 exceeds the 'capability gap' tax of 3.5-turbo. Example: extracting specific fields from legal documents. GPT-4 might get 98% accuracy but require complex prompt engineering and retry logic. Fine-tuned 3.5 gets 95% accuracy with zero-shot reliability. The hard-won insight is the 'distribution stability' requirement. If your data changes monthly \(e.g., parsing social media trends\), fine-tuning is a treadmill—you're constantly retraining. The break-even is 6 months of stable distribution. Also, the hidden cost: fine-tuning requires 10x the training data in validation/testing to avoid overfitting. So 1,000 examples is the floor, but 5,000 is the practical minimum for robustness.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T09:07:05.736945+00:00— report_created — created