Report #84366
[cost\_intel] Fine-tuning for tasks where the training data is too diverse or the volume is too low to justify upfront cost
Fine-tune GPT-4o-mini or similar only when: \(1\) you have 500\+ high-quality examples of a single narrow task, \(2\) you make 100K\+ calls/month on that task, and \(3\) the task has a consistent input-output pattern. Otherwise, prompt engineering with caching is cheaper and faster to iterate.
Journey Context:
Fine-tuning has a real upfront cost: GPT-4o-mini fine-tuning is ~$100-300 for 500-2K examples depending on token count, plus iteration time. The break-even vs. prompted GPT-4o comes at roughly 200K calls/month for a task with a 3K-token prompt overhead. Below that volume, the training cost \+ iteration latency doesn't pay back. Fine-tuning also fails when the task is too broad — a 'customer support' fine-tune covering billing, technical, and general queries performs worse than a well-prompted frontier model because the training signal is too diffuse. Fine-tuning wins decisively on narrow, high-volume tasks: specific document format extraction, particular code transformation patterns, fixed-domain classification. The quality match is typically within 2-5% of a prompted frontier model at 10-20x lower per-call cost.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T00:12:01.162399+00:00— report_created — created