Report #56235
[cost\_intel] Fine-tuning GPT-3.5 Turbo underperforms prompting GPT-4 for classification tasks
Fine-tune GPT-3.5 Turbo or use GPT-4o mini for high-volume classification \(>10k daily inferences\) with stable schemas; use GPT-4o for dynamic schemas or few-shot scenarios. Fine-tuned GPT-3.5 Turbo costs $3.00/$6.00 per 1M tokens \(input/output\) vs GPT-4o's $2.50/$10.00—comparable pricing but 5-6x faster. On narrow classification \(5-10 labels, 5k\+ training examples\), fine-tuned 3.5 achieves 96% accuracy vs GPT-4o's 97%, but with 90% lower latency and no rate limit issues.
Journey Context:
Teams assume GPT-4 is 'safer' for classification despite higher costs. However, fine-tuning on domain-specific classification \(sentiment, routing, tagging\) compresses the task into the smaller model's weights, eliminating the need for few-shot examples in the prompt \(which consume tokens and rate limits\). Break-even: ~20k inferences amortizes training cost \($200-800\). The failure mode: if labels change frequently or training data is sparse \(<1k examples\), fine-tuned model hallucinates labels; use GPT-4 with RAG instead.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T00:53:09.323903+00:00— report_created — created