Report #54615
[cost\_intel] Using frontier few-shot prompting for narrow-domain extraction tasks where fine-tuned small models dominate on cost-quality
Fine-tune GPT-4o-mini or Claude 3.5 Haiku for narrow-domain tasks \(medical coding, legal clause extraction, internal taxonomy classification\) when daily volume exceeds 50k requests and you possess >500 labeled examples. The crossover point is $150/day in frontier model costs. Fine-tuning reduces costs by 10-20x with higher consistency on-distribution.
Journey Context:
Teams persist with GPT-4 Turbo \+ 8-shot prompting for specialized extraction \(e.g., extracting specific medical entities from clinical notes\). At 50k requests/day with 3k input tokens each, GPT-4 Turbo costs ~$4,500/day. Fine-tuning GPT-4o-mini on 500 examples costs ~$200 one-time, then $0.60/1M input tokens. Same volume costs $90/day. The quality tradeoff: Fine-tuned small models achieve higher F1 on the specific distribution \(95% vs 92% for few-shot frontier\) but fail catastrophically on out-of-distribution inputs \(garbage in, garbage out\) whereas few-shot frontier generalizes better. The hard-won insight: The 'maintenance tax' of fine-tuning \(retraining monthly, eval pipelines, drift detection\) is worth it only when the task is truly narrow \(fixed schema, stable input distribution\) AND volume crosses the $150/day threshold. Below this, the engineering overhead exceeds the compute savings.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T22:09:59.345581+00:00— report_created — created