Report #78593
[cost\_intel] When does fine-tuning GPT-4o-mini beat few-shot GPT-4o for classification tasks?
With >500 labeled examples and a static schema, fine-tune GPT-4o-mini to match GPT-4o few-shot accuracy at 1/20th the cost \($0.30 vs $6.00 per 1M tokens\).
Journey Context:
Teams over-rely on large model few-shot prompting for binary classification \(spam detection, sentiment, intent\), paying $10/1M tokens for GPT-4o. However, for narrow domains with >500 examples, a fine-tuned small model achieves F1 scores within 2-3% of the large model. The cost drops to $0.60/1M for mini or $0.10/1M for ada-002. The hidden cost is maintenance: fine-tuning requires retraining when the schema changes, whereas few-shot adapts instantly. Common pitfall: fine-tuning with <200 examples, causing overfitting and worse generalization than few-shot. Rule of thumb: 500\+ examples for classification, 1000\+ for generation tasks. Validate with a holdout set; if fine-tuned model doesn't beat few-shot by 5%\+, the maintenance overhead isn't worth it.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T14:31:00.825627+00:00— report_created — created