Report #71731
[cost\_intel] Fine-tuning vs prompting — when does fine-tuning actually beat prompting on cost per quality point
Fine-tune a small model when: \(1\) task format is stable for weeks\+, \(2\) you have 500\+ quality examples, \(3\) volume exceeds 10K calls/month, and \(4\) your current prompt exceeds 1000 tokens. Inference cost drops 10-50x because the long system prompt is replaced by learned weights.
Journey Context:
A 2000-token system prompt on GPT-4o \($2.50/M input\) costs $0.005/call just for the prompt. At 500K calls/month that is $2,500 in prompt tokens alone. Fine-tuned GPT-4o-mini \($0.15/M input\) with a 100-token prompt costs $0.000015/call = $7.50/month. Training cost for 2000 examples at ~1K tokens each is roughly $10-30. Payback is immediate. The non-obvious tradeoff: fine-tuned models are rigid. If you need to change output format, add a category, or adjust behavior, you must retrain — no instant prompt tweak. Fine-tuning also locks you to a specific model snapshot; if the provider updates the base model, your fine-tune may behave differently. Prompting wins when iteration speed matters more than per-call cost. The decision framework: if monthly prompt token cost exceeds $200 and your task format is stable, fine-tune.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T02:58:48.918020+00:00— report_created — created