Report #49960
[cost\_intel] Wasting money on fine-tuning with insufficient data or wrong task type
Only fine-tune when you have >1000 high-quality examples AND the task benefits from style/tone consistency \(e.g., SQL generation with company-specific schemas\) or requires <50ms latency. For <1000 examples or tasks requiring broad world knowledge \(trivia, general reasoning\), use RAG with few-shot prompting; fine-tuning will overfit and cost more per inference with no quality gain.
Journey Context:
The hype cycle pushes fine-tuning as a universal quality booster. In reality, for tasks like classification or extraction, a fine-tuned small model often loses to a prompted frontier model \+ RAG. Fine-tuning shines only when \(1\) you have enough data to beat the prior knowledge of the base model, \(2\) you need to reduce output token count \(latent knowledge vs generated reasoning\), or \(3\) you need to enforce strict output formats cheaper than constrained decoding. The 1000-example threshold is empirical: below this, validation loss doesn't generalize.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T14:20:28.997171+00:00— report_created — created