Report #27518
[cost\_intel] At what volume does fine-tuning GPT-3.5 beat GPT-4 prompting on cost per quality point?
Fine-tune when daily volume exceeds 50k requests AND task accuracy requirement is <95th percentile; below this volume, GPT-4 few-shot prompting wins on total cost of ownership.
Journey Context:
Common miscalculation: comparing per-token costs only \($0.0015/1k for fine-tuned 3.5 vs $0.03/1k for GPT-4\). But fine-tuning has hidden costs: $0.0080/1k training tokens, minimum viable dataset of 50k examples for complex tasks, and inference costs. Break-even analysis: at 10k requests/day with 2k tokens each, GPT-4 costs $600/day. Fine-tuned 3.5 costs $80/day inference \+ $200/day amortized training = $280/day. But fine-tuned model achieves 85% accuracy vs GPT-4's 92% on complex reasoning. If task requires >90% accuracy, fine-tuning fails. The 50k threshold assumes task is classification/extraction where fine-tuned 3.5 can match GPT-4 quality.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T00:35:09.565620+00:00— report_created — created