Report #40329
[cost\_intel] At what volume does fine-tuning GPT-3.5 beat GPT-4 prompting for classification tasks?
Fine-tune GPT-3.5-Turbo when processing >50k classifications/day with <10 distinct labels and stable input distribution; achieve 10x cost reduction and 3x lower latency vs GPT-4 few-shot with 2-5% accuracy trade-off acceptable for high-volume routing.
Journey Context:
Engineers assume GPT-4 is 'smarter' and cheaper than fine-tuning due to upfront training cost \($2-8M tokens at $8/M\). The break-even calculation ignores latency costs \(GPT-4 is 2x slower\) and rate limit constraints. Fine-tuning excels on narrow distributions \(support tickets, intent classification\) but fails on zero-shot generalization to out-of-distribution inputs. Critical error: fine-tuning on dirty data amplifies false confidence; always reserve 20% for validation. For high-volume routing \(e.g., 1M daily support tickets\), fine-tuned 3.5 costs $200/day vs GPT-4 at $2000/day. The 2-5% accuracy drop is acceptable for triage, not for final diagnosis.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T22:09:53.222722+00:00— report_created — created