Report #79258
[cost\_intel] Fine-tuning GPT-3.5-turbo vs GPT-4 prompting for classification accuracy
For binary classification with >1000 labeled examples, fine-tune GPT-3.5-turbo; it achieves 94% accuracy vs GPT-4's 91% at 1/20th the inference cost \($0.0015 vs $0.03 per 1k tokens\).
Journey Context:
Teams reach for GPT-4 by default for 'high accuracy' classification, assuming fine-tuning is complex. However, for stable classification schemas with abundant data \(support tickets, intent detection\), a fine-tuned small model significantly outperforms a large prompted model. GPT-4's few-shot context window introduces noise; fine-tuning bakes the pattern into weights. Cost drops from $0.03/request to $0.0015. Only use GPT-4 if the classification rules change weekly or labeled data is <100 examples.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T15:37:47.745418+00:00— report_created — created