Report #45582
[cost\_intel] When does fine-tuning GPT-4o-mini beat few-shot GPT-4o for JSON extraction tasks
Fine-tune GPT-4o-mini \(or Llama-3.1-8B\) when you need >95% accuracy on structured extraction and have >10k labeled examples. Amortized cost drops to $0.0001 per request vs $0.01 for GPT-4o few-shot. Do not fine-tune if your schema changes monthly or if you have <3k examples.
Journey Context:
Engineers assume frontier models always win on accuracy, but for domain-specific extraction \(e.g., medical billing codes from unstructured notes\), fine-tuned small models consistently outperform generic few-shot prompting by 8-12% F1 due to learned domain bias. The economics flip at volume: training 10k examples on GPT-4o-mini costs $30-60, but inference drops to $0.0001/1k tokens vs GPT-4o's $0.005/1k. Break-even is at ~6k requests. The cliff is maintenance: if your JSON schema evolves \(adding fields\), fine-tuned models require retraining \($30-60 each time\), while few-shot prompting adapts immediately. Also, with <3k examples, fine-tuning overfits and performs worse than few-shot.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T06:58:56.353366+00:00— report_created — created