Report #25388
[cost\_intel] When does fine-tuning beat few-shot prompting for structured extraction?
Fine-tune only when you have more than 500 examples and the output schema exceeds 5 nested levels or 20 fields; below this threshold, 5-shot prompting with Claude 3.5 Sonnet or GPT-4o matches fine-tuned smaller models on accuracy at lower total cost.
Journey Context:
Teams rush to fine-tune for 'consistency' in JSON extraction, assuming it reduces costs and improves reliability. However, modern frontier models with JSON mode \(constrained decoding\) and 5-shot examples achieve >95% accuracy on flat schemas \(<10 fields\) without fine-tuning. Fine-tuning requires 500\+ examples to generalize well \(less causes overfitting\), costs $50-300 in compute, and creates maintenance debt. Fine-tuning only wins when the task requires learning implicit domain conventions \(e.g., medical coding schemes, legal clause taxonomy\) that cannot be described in a prompt, or when the schema is so deeply nested \(>5 levels\) that few-shot examples exceed the context window.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T21:00:58.718373+00:00— report_created — created