Report #54430
[cost\_intel] Premature fine-tuning on small datasets or continuing to prompt-engineer at massive scale
Fine-tune only when: \(1\) you have >10k labeled examples, \(2\) task requires specific style/tone consistency, OR \(3\) prompt length exceeds 2k tokens due to few-shot examples. The cost crossover typically occurs at 1M\+ requests/month for standard tasks; below this, few-shot prompting with Haiku/Flash is cheaper including latency.
Journey Context:
Fine-tuning incurs upfront training costs \($30-100\+ for GPT-4o-mini, higher for larger models\) plus ongoing inference costs that are often higher than base model per-token rates \(e.g., fine-tuned GPT-4o-mini costs 4x the base model per token\). The value proposition is reducing input tokens by eliminating long prompts/system instructions, and improving quality on narrow distributions. Common error: fine-tuning with <1k examples which causes overfitting and worse generalization than few-shot. Another error: fine-tuning for tasks where the base model already achieves >95% accuracy \(waste of money\). Calculation: if you save 1k input tokens per request via fine-tuning \(removing few-shot examples\), at $0.15/1M tokens saved \(Haiku rate\), you save $0.00015 per request. Amortizing $100 training cost requires 666k requests to break even. Thus, high volume is mandatory. Exception: fine-tuning for latency \(shorter prompts = faster TTFT\) or specific output formats that base models struggle with \(rare edge cases\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T21:51:19.710543+00:00— report_created — created