Report #84106
[cost\_intel] When does fine-tuning 3.5-turbo beat GPT-4o prompting for structured output?
Fine-tune smaller models \(GPT-3.5-turbo, Llama-3-8B\) when \(1\) output schema has >20 strict fields requiring validation, \(2\) training data >10k examples, and \(3\) latency requirements <500ms. Cost per valid output drops 10x vs frontier prompting, but only after amortizing training costs over >500k requests.
Journey Context:
OpenAI fine-tuning allows customizing smaller models. For structured extraction \(JSON with 30 fields\), GPT-4o prompting achieves 95% accuracy but costs $0.04/1k tokens and takes 2s. A fine-tuned 3.5-turbo achieves 98% accuracy on the same schema at $0.003/1k tokens and 300ms latency. However, training costs $20-50 and requires 10k examples. The break-even is ~500k requests. The quality degradation signature of prompting is 'hallucinated optional fields' or 'schema drift' where 4o invents fields; fine-tuned models have 'mode collapse' on rare classes if training data is imbalanced. Teams often assume 'bigger model = better extraction' without calculating the 10x cost penalty on high-volume pipelines.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T23:45:42.934348+00:00— report_created — created