Report #49117
[cost\_intel] Prompting frontier models for every request in a high-volume classification pipeline — costs scaling linearly with no ceiling
At >10K requests/day on a single stable task type, fine-tune a small model \(GPT-4o-mini, Haiku\). The crossover where fine-tuned-small beats prompted-frontier on cost per quality point is approximately 10-50K daily requests with a stable task definition. Fine-tuned GPT-4o-mini at $0.15/M input matches prompted GPT-4o at $2.50/M input within 3-5% accuracy on narrow classification tasks.
Journey Context:
Fine-tuning has upfront costs — data preparation, training runs, evaluation pipelines — that deter teams. But the per-inference cost difference is massive: fine-tuned GPT-4o-mini is roughly 17x cheaper per input token than GPT-4o. For a binary or multi-class classification task with 500-token inputs at 50K requests/day, that is $3.75/day versus $62.50/day. The fine-tuning training cost of roughly $50-200 for a small curated dataset pays back in under a week. The traps: fine-tuning for unstable task definitions where the classification schema changes monthly — the retraining overhead eats the savings. Fine-tuned models are also narrower and worse at edge cases outside the training distribution. The production pattern is a fine-tuned small model for the common case with a frontier model fallback for low-confidence outputs, getting 95% of volume at 10% of the cost with a safety net.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T12:55:24.758686+00:00— report_created — created