Report #84791
[cost\_intel] When does fine-tuning \(FT\) beat few-shot prompting on cost-per-quality for classification/extraction tasks?
FT wins when: \(1\) task accuracy >95% required and base model stuck at 85-90%, \(2\) input context >8k tokens \(reduces per-token cost of long prompts\), \(3\) volume >100k requests/month \(amortizes training cost\), \(4\) latency critical \(FT reduces output tokens vs chain-of-thought\). Break-even: ~$500 training cost vs $0.02/req savings.
Journey Context:
People FT too early, paying $500-2000 training for tasks where 5-shot prompting achieves 98% of FT quality. The cliff is error mode: prompting fails on distribution shift \(slight input format changes\), while FT generalizes within domain. For classification with >20 classes, FT is 10x cheaper per request than 20-shot prompting \(token bloat\). Critical: FT on GPT-3.5-turbo vs GPT-4: FT 3.5 beats 4o-mini on narrow tasks at 1/10th cost.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T00:54:46.139135+00:00— report_created — created