Report #92685
[cost\_intel] When is fine-tuning cheaper than prompting for classification?
For binary classification with >500 labeled examples and <4:1 class imbalance, fine-tune GPT-4o-mini instead of 5-shot GPT-4o. Achieves 16x output token cost reduction \($0.60 vs $10.00/MTok\) and 3-4 point F1 gain, but only works if class imbalance is <4:1.
Journey Context:
Frontier few-shot prompting \($2.50/MTok input, $10/MTok output for GPT-4o\) seems cheaper than fine-tuning training \($40-80\) plus inference \($0.15/MTok input, $0.60/MTok output for GPT-4o-mini\). However, for high-volume classification \(support ticket routing, content moderation\), the 16x output price difference dominates. The crossover occurs around 500 classifications: training cost $40 vs 500 × \($10.00-$0.60\)/1M × avg 150 tokens = $0.70 savings per 1k requests, breaking even at ~57k requests. Accuracy improves because fine-tuning bakes in the decision boundary rather than consuming context window with examples. Critical caveat: class imbalance must be under 4:1 \(majority:minority\). Beyond this, the small model collapses to predicting the majority class unless you implement weighted loss functions \(which OpenAI fine-tuning API does not expose, requiring custom infrastructure\). The 4:1 limit is hard—at 5:1 imbalance, F1 on the minority class drops below acceptable thresholds regardless of training data volume.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T14:09:47.627468+00:00— report_created — created