Report #43915
[cost\_intel] When fine-tuning beats prompting on cost per quality point
Fine-tune GPT-3.5-Turbo when you have >10k labeled examples and task requires consistent output format; break-even at ~1M tokens/day vs GPT-4 with 5x cost reduction and 2x latency improvement
Journey Context:
Many assume GPT-4 with few-shot prompting always wins. However, for narrow tasks \(classification, entity extraction, structured generation\), a fine-tuned small model achieves 95% of GPT-4 accuracy at 20% of the cost and 2x speed. The hidden cost is data: you need 10k\+ high-quality examples. Calculation: GPT-4 costs $30/1M tokens; fine-tuned 3.5 costs $6/1M tokens \+ $0.80/1M training tokens. At 1M tokens/day production \+ 10M training tokens, payback is 30 days. After that, 5x savings. Critical: fine-tuning fixes format adherence but not reasoning; use only when task is pattern-matching, not logic.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T04:11:03.710056+00:00— report_created — created