Report #74705
[cost\_intel] When does fine-tuning beat prompting on cost per quality point?
Fine-tune GPT-3.5-Turbo when schema has >10 nested fields and volume exceeds 500k calls/month; achieves GPT-4 quality at 1/10th cost after $2k training investment. Do not fine-tune for <1000 examples or evolving schemas.
Journey Context:
Many assume GPT-4 is needed for complex JSON schemas. But with 500\+ high-quality examples, fine-tuned smaller models learn the specific output distribution and constrained grammar, removing the 'thinking' tokens that inflate costs. GPT-4 excels at ambiguous schemas or evolving requirements \(weekly changing fields\). For stable extraction \(receipt parsing, resume standardization\), fine-tuning removes reasoning overhead. Break-even: Training 1M tokens \(~$20\) \+ inference savings \($0.002 vs $0.03 per 1K\) pays back in ~70k calls. Common failure: fine-tuning on 200 examples, which captures noise not signal, resulting in worse quality than prompting.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T07:59:17.482301+00:00— report_created — created