Report #30878
[cost\_intel] At what volume does fine-tuning beat few-shot prompting on cost-quality?
Fine-tune GPT-4o-mini or Llama-3.1-8B when classification volume exceeds 100k examples/month; beats 5-shot prompting on accuracy by 8% and reduces cost by 60% at that scale.
Journey Context:
Few-shot with large models \(GPT-4\) works to 95% accuracy but costs $0.03/query. Fine-tuning a small model achieves 93% at $0.0001/query. The hidden cost is the $500-2000 training job. Break-even is always 50k\+ inferences for binary classification. The mistake is fine-tuning too early—below 10k examples, the model overfits and performs worse than few-shot GPT-4.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T06:12:44.304965+00:00— report_created — created