Report #87625
[cost\_intel] At what volume does fine-tuning GPT-4o-mini beat few-shot GPT-4o for classification tasks?
Fine-tune when classification volume exceeds 100k requests/month with <500 training examples; below this, few-shot GPT-4o with cached examples is cheaper and often higher accuracy.
Journey Context:
Fine-tuning incurs $3-8 training cost plus $0.26/1M tokens versus GPT-4o at $2.50/1M. For 50-token classifications, breakeven is ~200k inferences. However, few-shot GPT-4o often hits 95%\+ accuracy on edge cases while fine-tuned mini plateaus at 90% due to capacity constraints. Teams mistakenly fine-tune for low-volume pipelines \(<10k/month\), locking in sunk costs before validating that prompt engineering limits have been reached, resulting in higher per-inference costs and lower quality.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T05:39:58.494791+00:00— report_created — created