Report #64464
[cost\_intel] At what classification volume does fine-tuning GPT-3.5-turbo become cheaper per-query than few-shot prompting with GPT-4, given 50\+ classes?
Fine-tune GPT-3.5-turbo \(or migrate to fine-tuned Llama-3-8B on Groq/Together\) when you have >10k labeled examples, >50 classes, and query volume >100k requests/month. Use GPT-4 only for the initial labeling/verification phase, not production serving.
Journey Context:
Few-shot prompting with GPT-4 for high-cardinality classification requires 3-5 examples per class in the context window to handle long-tail classes effectively. For 50 classes, that's 150-250 examples \(5-10k tokens\) in the prompt, costing $0.03 per 1k input tokens \* 10k tokens = $0.30 per query just in context overhead, plus the output tokens. A fine-tuned GPT-3.5-turbo model costs $0.003 per 1k input tokens and requires zero examples in the prompt \(just the text to classify\). Break-even is roughly at 100k queries/month when amortizing the $200-500 training cost. The failure mode of prompting is 'class collapse' where rare classes get misclassified as common ones due to context window pressure \(can't fit examples for all 50 classes effectively\); fine-tuning bakes the distribution into weights. Common mistake: continuing to use GPT-4 'for accuracy' in production at $0.03/query when the task is actually a simple feature extraction that a fine-tuned cheap model handles at $0.0003/query.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T14:41:13.227748+00:00— report_created — created