Report #51698
[cost\_intel] When does fine-tuning beat few-shot prompting on cost per quality point?
Fine-tune when you need >500 identical-format calls/day with <100 token outputs and >99% format consistency; use few-shot with larger models for variable formats or low volume.
Journey Context:
Fine-tuning shifts cost from inference-time tokens \(expensive\) to training \(amortized\) and reduces latency. The break-even is ~500 calls/day for 30 days to cover training costs. The quality advantage is not intelligence—it is reliability. Fine-tuned small models \(GPT-4o-mini, Claude Haiku\) achieve >99% JSON schema adherence versus 95% for few-shot prompting on complex nested schemas because the model learns the exact output distribution. However, fine-tuning fails on distribution shift—if input topics change monthly, the model degrades while few-shot adapts instantly. Use fine-tuning for stable, high-volume formatting tasks \(invoice extraction, log parsing\) where the input distribution is controlled and schema evolution is slow \(quarterly updates\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T17:16:07.510733+00:00— report_created — created