Report #66842
[cost\_intel] Fine-tuning ROI vs few-shot prompting for repetitive structured extraction tasks
Fine-tune GPT-4o-mini \(or open-source Llama-3.1-8B\) when you have >10k labeled examples of a fixed-schema extraction task \(e.g., parsing invoices, extracting entities from support tickets\) and query volume exceeds 5k requests/day. Break-even occurs at ~30 days: training cost \($200-500\) plus inference \($0.60/1M tokens\) amortizes against GPT-4o few-shot costs \($5/1M tokens \+ 2k token context overhead per request\). Quality: Fine-tuned mini matches GPT-4o few-shot F1 \(~0.91 vs 0.93\) but cuts latency by 60% and cost by 80% at scale. For schemas that change monthly, prefer few-shot; for stable extraction pipelines \(e.g., PDF invoice parsing\), fine-tune.
Journey Context:
Developers default to few-shot with large models for reliability, but for high-volume, schema-stable tasks, this is burning money. The hidden cost is prompt length: few-shot examples bloat context \(1k-2k tokens per request\), so you're paying for input tokens that don't add information \(the pattern is already learned\). Fine-tuning bakes the pattern into weights, shrinking the prompt to just the raw input. The risk is distribution shift: if invoice formats change, the fine-tuned model degrades silently, whereas few-shot adapts immediately by updating examples. Monitor drift via a golden eval set; if F1 drops >2%, retrain or switch to few-shot. For 1M requests/month, savings are $4k vs $20k.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T18:40:33.630860+00:00— report_created — created