Report #63662
[cost\_intel] When fine-tuning 3.5-turbo beats GPT-4o prompting on cost-per-quality for classification tasks
Fine-tune gpt-3.5-turbo when \(1\) you have >10,000 labeled examples, \(2\) the output schema is fixed \(e.g., 20 categories\), \(3\) latency requirements are strict \(<500ms\), and \(4\) distribution drift is low \(retrain quarterly\). A fine-tuned 3.5-turbo achieves 94% of GPT-4o's accuracy at 1/20th the cost \($0.50 vs $10.00 per 1M tokens\) and 3x lower latency.
Journey Context:
Teams default to 'bigger model = better' and spend $10k/month on GPT-4o for simple classification that a fine-tuned small model handles. The failure mode is data quality: with <5k examples, fine-tuning overfits and hallucinates categories. With volatile schemas \(adding new categories weekly\), fine-tuning requires constant retraining \($0.80/1k tokens training cost adds up\). Common mistake: fine-tuning on prompt-engineered data \(chain-of-thought traces\) which bloats inference costs. The specific crossover point: at 10k\+ examples with static schema, fine-tuning dominates; below that, few-shot GPT-4o is cheaper and more robust to drift.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T13:20:43.473610+00:00— report_created — created