Report #63080
[cost\_intel] At what inference volume does fine-tuning a 4o-mini model beat few-shot prompting on GPT-4o on cost-per-quality-point?
Fine-tune gpt-4o-mini when monthly inference exceeds 50M tokens on a specific narrow task \(e.g., classification with <10 classes\) AND few-shot prompting requires >5 examples per query to reach target accuracy; below this, the $30-100 training cost plus latency penalty makes prompting cheaper.
Journey Context:
Fine-tuning costs $30-100 per job plus inference at 50-75% discount, but adds ~200-500ms latency. Few-shot prompting on frontier models costs more per token but zero upfront. Break-even depends on task narrowness: fine-tuning excels on narrow distributions \(sentiment of specific product lines\) but fails on broad tasks. For a specific 5-class classification on 4o-mini: fine-tuned achieves 94% accuracy with 0 examples \(fast\), few-shot needs 5 examples to hit 92% at 5x token cost. At 10M tokens/month, prompting costs $150, fine-tuning costs $50 \(inference\) \+ $50 \(amortized training\) = $100. People miss the latency penalty and the 'specificity' requirement—fine-tuning a general assistant is waste.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T12:21:35.843453+00:00— report_created — created