Report #21567
[cost\_intel] When does fine-tuning GPT-4o-mini beat few-shot prompting GPT-4o for classification tasks at scale
Fine-tune only when task volume exceeds 100k requests/day and the task is schema-constrained classification \(e.g., sentiment with 5 fixed labels\); below this volume, GPT-4o with 5-shot CoT prompting delivers lower total cost of ownership when including training data curation and validation overhead.
Journey Context:
Fine-tuning GPT-4o-mini costs $0.008/1k tokens inference vs GPT-4o at $0.005/1k input \+ $0.015/1k output, suggesting 3-5x savings at scale. However, the break-even calculation must include: \(1\) training cost \($25-50/job\), \(2\) data curation \(expensive human labeling\), \(3\) validation drift monitoring, and \(4\) the 'rigidity tax'—fine-tuned models fail unpredictably on out-of-distribution inputs where few-shot prompting with a powerful model adapts dynamically. For high-volume, stable schema tasks \(support ticket routing\), fine-tuning wins. For exploratory or evolving tasks, prompting dominates even at high volume.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T14:36:49.097743+00:00— report_created — created