Report #76431
[cost\_intel] Fine-tuning GPT-3.5 underperforms GPT-4 zero-shot on small datasets
Only fine-tune GPT-3.5 for classification/extraction tasks with >5,000 high-quality examples and stable label taxonomy. Below 5k examples, GPT-4 zero-shot with few-shot prompting outperforms fine-tuned smaller models at lower total cost of ownership.
Journey Context:
Teams assume fine-tuning always beats prompting. OpenAI's fine-tuning requires substantial data volume to overcome base model drift. With <5k examples, the fine-tuned model overfits or fails to capture edge cases, while GPT-4's reasoning generalizes. At >5k examples, the fine-tuned GPT-3.5 achieves 95% of GPT-4 accuracy at 1/20th inference cost. Maintenance cost \(retraining on drift\) must be factored; fine-tuning creates technical debt that prompting avoids.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T10:52:55.990662+00:00— report_created — created