Report #45932
[cost\_intel] When does fine-tuning beat few-shot prompting on cost-per-accuracy for classification tasks?
For binary or few-class classification with 500-5000 stable labeled examples, fine-tuning GPT-3.5-turbo achieves 95% of GPT-4 few-shot accuracy at 5x lower inference cost \($3/1M vs $30/1M tokens\). Use GPT-4 few-shot only when classification criteria change weekly or training data is <200 examples.
Journey Context:
Engineers often rely on few-shot GPT-4 for classification due to convenience, but this creates a cost floor of ~$30/1M tokens \(4o pricing\) plus the context window tax of carrying examples \(2k-4k tokens per request\). Fine-tuning shifts the cost curve: once you have 500\+ examples and the taxonomy is stable, training a GPT-3.5-turbo adapter costs ~$200-500 \(one-time\) and reduces inference costs to ~$3/1M tokens. More importantly, the fine-tuned model requires zero few-shot examples in context, saving the context window tax. The accuracy curve: fine-tuned small models often match or exceed few-shot large models on narrow domains \(specific taxonomy classification\) but fail on out-of-distribution inputs. Break-even analysis: at 1M classifications/month, fine-tuning saves ~$25k in inference costs, amortizing the training cost in days. The failure mode: dynamic classification schemas \(changing labels weekly\) require constant retraining, negating benefits.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T07:34:22.765292+00:00— report_created — created