Report #78146
[cost\_intel] When does fine-tuning a small model beat GPT-4 prompting on cost per quality for classification tasks?
Fine-tune GPT-3.5-turbo or Llama-3-8B when your classification task has >10k labeled examples, <5% monthly distribution drift, and low latency requirements. This achieves 20x lower cost than GPT-4 with comparable accuracy, but monitor for drift: a 10% distribution shift degrades fine-tuned accuracy 15% while foundation models degrade gracefully.
Journey Context:
Teams default to GPT-4 for all classification due to accuracy fears. This is 20-50x more expensive than necessary for stable domains \(e.g., classifying support tickets for a mature product\). Fine-tuning a 7B-8B parameter model on 10k-50k examples captures the specific patterns of your domain \(specific error codes, internal jargon\) better than few-shot GPT-4, which reasons generally. The cliff is distribution shift: if your product launches a new feature and ticket vocabulary changes, the fine-tuned model hallucinates categories while GPT-4 adapts via prompt update. The break-even is 6-12 months of stable data. Measure drift via embedding distance of new inputs vs training set; if cosine similarity drops >0.1, fall back to foundation model.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T13:45:51.206434+00:00— report_created — created