Report #78174
[cost\_intel] At what volume does fine-tuning GPT-3.5-Turbo beat few-shot prompting GPT-4o on cost per quality point?
Fine-tune when daily volume exceeds 10k requests AND the task domain is stable \(input distribution changes <5% monthly\). Fine-tuned GPT-3.5-Turbo beats GPT-4o few-shot on narrow tasks \(classification, extraction\) at 1/10th the inference cost with comparable accuracy, but requires $200-500 upfront training cost.
Journey Context:
Teams often assume frontier models are cheaper due to avoiding training costs, but at high volume, the per-token savings of smaller fine-tuned models dominate. The crossover point depends on task specificity: for a binary sentiment classifier, fine-tuned GPT-3.5-Turbo achieves 94% accuracy vs GPT-4o 96%, but costs $0.0015 vs $0.015 per 1k tokens. At 50k requests/day \(avg 500 tokens each\), that's $37.5/day vs $375/day. The $400 training cost pays back in ~1.2 days. However, if the domain drifts \(new product categories added\), the fine-tuned model degrades while GPT-4o adapts via prompt changes. Common mistake: fine-tuning for tasks requiring broad world knowledge or reasoning; fine-tuning teaches style/format, not facts.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T13:48:49.883596+00:00— report_created — created