Report #92773
[cost\_intel] Prompting frontier models for high-volume stable classification instead of fine-tuning small models
Fine-tune GPT-4o-mini for classification tasks exceeding 50K requests/month with stable definitions. Fine-tuned small models match frontier prompt quality at 10-20x lower cost per inference. The crossover: fine-tuning investment amortizes over roughly 10K requests.
Journey Context:
Fine-tuning GPT-4o-mini costs ~$100-300 for training \(depending on dataset size\) and inference runs $0.15/M input \+ $0.60/M output — vs GPT-4o at $2.50/M \+ $10/M. For a classification task with 1K-token inputs and 10-token outputs: GPT-4o costs ~$2.60/1K requests; fine-tuned 4o-mini costs ~$0.21/1K requests. At 100K requests/month, that is $260 vs $21. Fine-tuning works when: \(1\) task definition is stable \(won't change weekly\), \(2\) you have 500\+ labeled examples, \(3\) the task is classification or structured extraction, not open-ended generation. It fails when task drifts frequently \(retraining cost and operational overhead exceed inference savings\) or the task requires deep reasoning that fine-tuning cannot embed. Common mistake: fine-tuning for tasks that change often — the retraining cost and deployment friction exceeds the inference savings.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T14:18:29.527471+00:00— report_created — created