Report #60542
[cost\_intel] When does fine-tuning a small model beat few-shot prompting a large model on cost-per-quality?
For classification or extraction tasks with >10k labeled examples and stable schema, fine-tune GPT-4o-mini \(or Haiku\) instead of few-shot GPT-4o/Sonnet; expect 5-10x cost reduction at equivalent accuracy after ~50k inferences.
Journey Context:
Teams default to large models with elaborate prompts because 'fine-tuning is expensive/hard.' But for high-volume, repetitive tasks \(sentiment analysis, spam detection, PII tagging\), a fine-tuned small model often matches or beats a prompted large model. The economics: GPT-4o-mini is ~8x cheaper than GPT-4o. Training cost is $20-100 for 10k-100k examples \(one-time\). Inference savings accumulate. At 100k inferences, you've saved $400 \(GPT-4o cost\) vs spent $100 \(training\) \+ $50 \(mini inference\). The quality cliff: fine-tuning fails on out-of-distribution inputs or tasks requiring broad world knowledge \(e.g., 'is this novel medical claim true?'\). It excels on narrow, pattern-matching tasks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T08:06:34.466014+00:00— report_created — created