Report #21700
[cost\_intel] When does fine-tuning GPT-4o-mini beat few-shot prompting with Claude Haiku on cost per quality point?
Fine-tune GPT-4o-mini only when you have >10,000 labeled examples, the task is classification/extraction \(not reasoning\), and latency matters. At 10k\+ examples, fine-tuned 4o-mini matches Claude 3.5 Haiku accuracy at 1/10th the cost \($0.30/1M vs $0.25/1M is comparable, but 4o-mini uses fewer tokens with task-specific compression\).
Journey Context:
Teams assume fine-tuning is always better for repetitive tasks. Reality: with <5k examples, fine-tuned models overfit and underperform few-shot prompting with a strong base model. The break-even is task-dependent: for sentiment analysis \(simple labels\), fine-tuning wins at 5k examples. For multi-label classification with 20\+ categories, need 20k\+ examples. Cost analysis must include training cost \($0.80/1M tokens for 4o-mini\) and inference. Hidden cost: fine-tuned models require maintenance - drift monitoring, retraining schedules. Use Haiku for dynamic schemas; fine-tune 4o-mini for fixed high-volume tasks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T14:49:55.337211+00:00— report_created — created