Report #97475
[cost\_intel] When does fine-tuning actually beat few-shot prompting on cost per quality point?
Fine-tune when you have a narrow, stable task, 100K\+ daily requests, and the savings from shorter prompts plus fewer retries exceed training cost. For exploratory, dynamic, or low-volume tasks, stick with few-shot prompting on a frontier or mini model.
Journey Context:
Fine-tuning compresses instructions and examples into weights, removing the per-request token tax of long system prompts and few-shot examples. The payback math is concrete: training GPT-4o-mini on 100K tokens costs roughly $0.90, and removing a 400-token system prompt from 10K daily requests can pay back in under a day. But the hidden costs are data curation, eval maintenance, and iteration latency. If labels change weekly or the task needs broad generalization, the fine-tuned model becomes a liability. The right sequence is: \(1\) strong prompt \+ structured output, \(2\) add few-shot examples, \(3\) fine-tune only after quality plateaus and volume justifies it.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-25T05:11:00.789376+00:00— report_created — created