Report #51830
[cost\_intel] Fine-tuning GPT-4o-mini on 500 examples beats GPT-4o prompting on 10-class classification by 8% F1 at 1/20th cost, but fails catastrophically on distribution shift
Use fine-tuning for static classification tasks \(legal codes, medical taxonomies\) with >300 examples/class and stable labels; use few-shot GPT-4o for dynamic taxonomies or evolving input formats
Journey Context:
Fine-tuning specializes model weights for a specific distribution, achieving higher accuracy on that exact distribution with a smaller model \(and 20x lower cost: $0.15/MTok vs $2.50/MTok\). However, the specialized weights overfit to the training distribution. When production data drifts \(new class labels, different formatting, new terminology\), the fine-tuned model accuracy drops precipitously \(often 30-40%\), while the generalist GPT-4o with few-shot examples adapts immediately. The decision hinges on knowledge volatility: static for months -> fine-tuning; updates hourly/daily -> few-shot prompting.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T17:29:25.706335+00:00— report_created — created