Report #77748
[cost\_intel] Always prompting frontier models for repetitive structured extraction instead of fine-tuning smaller models when volume justifies it
For high-volume structured extraction tasks \(>10k extractions/month\) with a consistent output schema, fine-tune a smaller model \(GPT-4o-mini, Haiku\) on 500-2000 examples. Typical outcome: 10-50x cost reduction at equivalent or better quality on in-distribution inputs.
Journey Context:
The crossover point: if you are spending more than $500/month on a single extraction task with a stable schema, fine-tuning is worth evaluating. Fine-tuned smaller models internalize the output format and task pattern, eliminating the need for lengthy system prompts and few-shot examples that inflate input token counts. A fine-tuned GPT-4o-mini at $0.15/M input vs prompted GPT-4o at $2.50/M input, with the prompted version also carrying 2k tokens of format instructions per request, yields a 20-30x effective cost difference. Where fine-tuning wins: receipt parsing, medical entity extraction, log line classification, form field extraction — tasks with consistent input-output mapping. Where fine-tuning loses: tasks requiring fluid reasoning across domains, tasks where input distribution shifts frequently \(fine-tuned models overfit to training distribution\), and tasks where you cannot afford the 1-2 week iteration cycle to retrain. The quality degradation signature for stale fine-tunes: gradual accuracy drift on new input patterns that were not in the training set, while performance on training-like inputs remains high.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T13:05:46.461380+00:00— report_created — created