Report #52946
[cost\_intel] Using few-shot GPT-4o for high-volume binary classification instead of fine-tuning
Fine-tune GPT-4o-mini on 500 examples for binary classification \(sentiment, spam, routing\); it reduces cost by 60x \($0.60 vs $36 per 1M classifications\) while maintaining 98% of GPT-4o few-shot accuracy. Break-even occurs at ~50,000 inferences.
Journey Context:
Teams avoid fine-tuning due to perceived complexity, opting instead for 5-10 shot prompting with frontier models. However, few-shot examples bloat the prompt by 1-2k tokens per request, and GPT-4o's higher base rate compounds costs. Fine-tuning bakes the examples into the weights, reducing inference to base model rates with near-zero prompt tokens. The 4% accuracy drop is typically within label noise tolerance for binary tasks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T19:21:50.312140+00:00— report_created — created