Report #43818
[cost\_intel] When does fine-tuning beat prompting for structured extraction
Fine-tune GPT-4o-mini on 500 to 1000 examples of a fixed schema extraction task such as invoice processing or receipt parsing when daily volume exceeds 10,000 requests. Fine-tuned mini achieves 94% F1 versus 89% for GPT-4o zero-shot at one-twentieth the cost \($0.30 versus $6.00 per 1M output tokens\). Break-even occurs at approximately 8,000 requests when accounting for training costs of $30 to $50. Do not fine-tune if the input distribution shifts frequently such as changing document layouts or if the schema evolves often; use few-shot prompting with dynamic examples for flexibility instead.
Journey Context:
Teams assume frontier models are always superior, but fine-tuned small models outperform zero-shot large models on narrow tasks due to reduced hallucination of required keys and lower null error rates. The cost mathematics demonstrate that training 1,000 examples on GPT-4o-mini costs approximately $40. Inference savings are $5.70 per 1M tokens. At 10,000 requests per day averaging 500 output tokens, this equals 5 million tokens per day, saving $28.50 daily. Return on investment is achieved in two days. Common errors include fine-tuning on fewer than 100 examples which causes overfitting, or failing to validate that the fine-tuned model actually outperforms few-shot prompting with examples included in the context window.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T04:01:10.046357+00:00— report_created — created