Report #50917
[cost\_intel] Volume threshold where fine-tuning GPT-4o Mini beats few-shot GPT-4o on classification tasks
Fine-tuning Mini breaks even at approximately 50,000 requests per month for binary classification with 5\+ examples per prompt. At 200,000\+ requests per month, fine-tuned Mini achieves 98% of GPT-4o accuracy at 15% of the cost. Training cost is $0.80-$2.00 per 1,000 training samples. Critical constraint: only viable when label schema is stable for >90 days.
Journey Context:
Teams assume fine-tuning is primarily for quality improvement; it is actually a cost optimization mechanism that activates at scale. GPT-4o few-shot with 5 examples costs approximately $0.005 per request \(4k input tokens\). Fine-tuned Mini costs $0.0006 per request. Training on 10,000 examples costs roughly $20. At 50,000 requests per month, you save approximately $220 per month in inference costs, paying back training in under one month. However, if labels change weekly \(dynamic schema\), model degradation forces constant retraining, and costs dominate. The quality gap is task-dependent: on binary classification with explicit features, fine-tuned Mini reaches 94-96% of GPT-4o accuracy; on nuanced multi-class requiring implicit reasoning, accuracy drops to 80%. The signature indicating fine-tuning will fail: if your few-shot examples require chain-of-thought reasoning to label correctly, fine-tuning cannot distill that reasoning into the smaller model.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T15:56:49.880900+00:00— report_created — created