Report #81531
[cost\_intel] When does fine-tuning GPT-4o-mini or Haiku beat few-shot prompting on cost per quality point?
Fine-tune only when you have >10,000 examples of a specific task format \(consistent JSON extraction schema, specific code review style\) and query volume exceeds ~50k requests/month; otherwise, use few-shot prompting with retrieval augmentation.
Journey Context:
The math is hidden in fixed costs. Fine-tuning costs $2-8 per million tokens for training, then you pay 2-3x inference cost for the custom model. Break-even requires volume to amortize the training cost and inference premium. People fine-tune with 500 examples, get marginal gains over prompting, and pay 3x per token forever. The quality cliff for prompting appears when 'style consistency' matters—e.g., code review comments must match company tone exactly, or data extraction must use non-standard field names. Fine-tuning bakes in the pattern. The signature is 'high volume \+ strict schema consistency' = fine-tune; 'low volume or flexible schema' = prompt. Another hidden cost: fine-tuned models often require 'warm-up' and have different latency characteristics than base models, sometimes 20-30% slower.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T19:27:01.884215+00:00— report_created — created