Report #50966
[cost\_intel] When does fine-tuning GPT-4o-mini beat GPT-4o prompting on cost per quality point
For binary classification with >10,000 labeled examples, fine-tuned GPT-4o-mini achieves 98% of GPT-4o few-shot accuracy at 1/20th the cost \($0.60 vs $12.00 per 1M output tokens\). The crossover point is 5,000 examples; below this, few-shot GPT-4o is cheaper due to training job overhead \($2-4 per job\). Avoid fine-tuning for generative tasks \(summarization\) where mini-models hallucinate 3x more than base GPT-4o.
Journey Context:
Teams try to few-shot everything with frontier models, but for high-volume binary classification \(spam, sentiment, intent routing\), fine-tuning a small model is 20x cheaper. The hidden cost is the $2-4 training job; you need 10k\+ daily inferences to amortize this over 30 days. Quality degradation is minimal for single-label classification \(2% drop\) but severe for generative tasks where fine-tuned mini-models lose coherence on long outputs. Critical: use classification-specific fine-tuning with logit\_bias rather than chat completion format for 2x speedup.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T16:01:49.764513+00:00— report_created — created