Report #90703
[cost\_intel] Fine-tuning small models vs prompting large models for structured extraction
Fine-tune GPT-4o-mini on 500\+ examples when your schema requires nested JSON arrays with >5 fields or conditional field presence. Fine-tuned mini matches GPT-4o few-shot accuracy on nested extraction at 1/20th cost \($0.15 vs $3.00 per 1M tokens\), while failing on zero-shot tool use.
Journey Context:
Engineers attempt to use GPT-4o-mini or Haiku for complex structured extraction via few-shot prompting, resulting in formatting hallucinations \(missing brackets, wrong nesting, type errors\). While few-shot prompting helps, the context window fills rapidly with tool definitions and examples \(token bloat\). Fine-tuning bakes the schema into the model weights, allowing the small model to recognize tool boundaries without massive prompt overhead. The break-even is at 3\+ nested fields or when the tool schema exceeds 2k tokens—beyond this, fine-tuning a small model is cheaper and more accurate than few-shotting a large one. Common error: fine-tuning without enough examples \(<100\) which fails to capture the schema constraints.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T10:50:22.421366+00:00— report_created — created