Report #84762
[cost\_intel] Fine-tuning GPT-4o-mini never beats GPT-4o prompting on cost-quality for low-volume tasks
Fine-tune 4o-mini only when monthly inference exceeds 5M tokens on a narrow task \(classification, extraction\); below this, few-shot prompting with 4o is cheaper and higher quality.
Journey Context:
Teams fine-tune for 'brand voice' or classification with <100k tokens/month usage, ignoring the fixed training cost \($30-60\) and per-token rate savings \(4o-mini input is $0.15/1M vs 4o at $2.50/1M\). The crossover is ~5M output tokens/month for classification tasks. More importantly, fine-tuned small models hallucinate on out-of-distribution inputs where 4o with 5-shot prompting generalizes better. The failure signature is high accuracy on training distribution but 40% accuracy on edge cases \(e.g., classification of mixed-language inputs if training was English-only\). Unless you have >10k labeled examples and high volume, prompting beats fine-tuning on both cost and quality.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T00:51:47.026451+00:00— report_created — created