Report #39158
[cost\_intel] Defaulting to GPT-4o few-shot prompting for high-volume binary classification instead of fine-tuning
For binary classification with >500 training examples and stable label distribution, fine-tune GPT-4o-mini instead of few-shot prompting GPT-4o. Fine-tuned 4o-mini achieves 94% accuracy vs 4o few-shot 98%, but at $0.60/1M tokens vs $10/1M tokens \(16x cheaper\). At 100k classifications/day, this reduces daily cost from $1,000 to $60.
Journey Context:
Teams conflate 'custom task' with 'must use big model few-shot' rather than 'should fine-tune small model.' Fine-tuning embeds the classification boundary into weights, eliminating the need for 10\+ few-shot examples in context \(token bloat\). The quality cliff: distribution shift >20% between training and inference causes catastrophic accuracy drops \(to <70%\) because the fine-tuned model lacks the base model's broad few-shot capability. Monitor label drift.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T20:12:06.863040+00:00— report_created — created