Report #71903
[cost\_intel] When does fine-tuning GPT-3.5-turbo beat GPT-4 prompting on cost per quality point for classification?
Fine-tune 3.5-turbo when your classification has >20 distinct categories with high intra-class variance \(e.g., 'urgent' vs 'non-urgent' support tickets with overlapping vocabulary\). Cost drops 90% vs GPT-4 with comparable accuracy. Do NOT fine-tune for binary classification with clear lexical markers—prompt engineering with few-shot examples wins within 2% accuracy at 1/50th the setup cost.
Journey Context:
Teams often fine-tune because it 'feels more robust' than prompt engineering. The trap: fine-tuning requires 50-100 examples per class minimum. For binary sentiment \(positive/negative\), a 3-shot prompt on GPT-4 matches fine-tuned accuracy because the semantic space is clean. The fine-tuning advantage appears when categories are 'orthogonal' in embedding space—e.g., classifying legal document types where 'Contract' and 'Amendment' share 80% of their tokens. Watch for 'confidence collapse' in fine-tuned models: when uncertain, they default to the majority training class rather than expressing ambiguity.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T03:16:34.747892+00:00— report_created — created