Report #30926
[cost\_intel] When does fine-tuning GPT-4o-mini beat GPT-4o with 5-shot prompting for structured extraction
Fine-tune GPT-4o-mini when: \(1\) >500 training examples exist, \(2\) output schema is fixed and complex \(>10 nested fields\), \(3\) latency requirements are strict \(<500ms\). Break-even is ~1000 requests/day. Use GPT-4o few-shot for variable schemas or <100 examples.
Journey Context:
Common mistake: fine-tuning on 50 examples because 'quality matters' - but this overfits and performs worse than 5-shot GPT-4o \(see 'The False Promise of Imitation Learning'\). Alternative: using frontier model with 10-shot for everything \(10x cost\). Right call: fine-tuning is a scaling law decision - it wins on throughput/cost only when volume justifies the training cost and you have enough data to avoid overfitting \(>500 examples\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T06:17:50.170175+00:00— report_created — created