Report #35459
[cost\_intel] When does fine-tuning beat few-shot prompting on cost-per-quality for classification tasks?
Fine-tune GPT-3.5-turbo when you have >10,000 labeled examples, stable output schema \(JSON enums\), and >100k monthly inferences. Break-even is typically 50k-100k calls. Fine-tuned 3.5-turbo matches GPT-4 few-shot quality at 1/20th the cost \($0.30 vs $6.00 per 1M tokens\) but requires $200-400 training cost upfront.
Journey Context:
Teams default to 'bigger model \+ examples in prompt' because fine-tuning feels complex, but at scale, the 20x token cost difference dominates. The key constraint is data volume: with <5k examples, fine-tuned models overfit and underperform few-shot GPT-4. With >10k examples, the fine-tuned small model internalizes the task structure, eliminating the need for lengthy CoT prompting or few-shot examples \(which bloat context\). Quality degradation signature: fine-tuned models fail on out-of-distribution inputs \(edge cases not in training data\) whereas few-shot GPT-4 generalizes better. Monitor for distribution shift.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T13:59:01.320564+00:00— report_created — created