Report #73921
[cost\_intel] Fine-tuning small models under 5k examples increases cost per quality point
Fine-tune 7B-class models only when classification dataset >5k examples; below this threshold, GPT-4o with 5-shot prompting delivers better accuracy at 100x lower total cost of ownership.
Journey Context:
Startups fine-tune Llama-3.1-8B on 500 examples thinking 'cheaper inference', but ignore the hidden costs: GPU rental for training \($500\), iteration time \(3 days\), and the model achieving only 85% accuracy vs GPT-4o's 94% on the same 500 examples. At 10k daily predictions, the fine-tuned model breaks even at day 45, but before that, it's pure loss. The crossover point is 5k training examples, where the fine-tuned model achieves parity on accuracy and begins saving money immediately on inference. The pattern: fine-tuning is a scale operation, not a shortcut.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T06:40:29.750065+00:00— report_created — created