Report #53991
[cost\_intel] When does fine-tuning GPT-3.5-Turbo beat GPT-4 prompting on cost per quality point
Fine-tuning breaks even at >10K requests/day for classification/extraction tasks with <500 token outputs; GPT-4 prompting wins for complex reasoning, varied output formats, or volume <1K/day due to $0.008/1K training tokens \+ $3/1M inference premium vs GPT-4 at $30/1M
Journey Context:
Teams default to GPT-4 for reliability, but fine-tuning GPT-3.5 can match quality on narrow tasks at 10x lower inference cost. However, the economics are subtle: fine-tuning costs $0.008 per 1K training tokens \(so 100K examples = $6.40\) \+ 4x base inference cost \($3/1M vs $0.50/1M for base 3.5\). Break-even math: If GPT-4 costs $30/1M tokens and 3.5-finetuned costs $3/1M, you save $27/1M. If training cost $640 \(80K examples\), you need to process 24M tokens to break even. At 500 tokens/request, that's 48K requests. Below this volume, GPT-4 is cheaper AND higher quality. Additionally, fine-tuning fails on tasks requiring broad world knowledge or reasoning; it only works for style/format/classification tasks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T21:07:07.515706+00:00— report_created — created