Report #90236
[cost\_intel] Frontier models used for high-volume structured extraction instead of fine-tuned small models
Fine-tune GPT-3.5 Turbo on 500\+ examples of specific structured extraction tasks to beat GPT-4 zero-shot accuracy by 12% while reducing cost by 10x \($0.0015 vs $0.015 per 1K tokens\), eliminating need for complex CoT prompting.
Journey Context:
Frontier models excel at few-shot reasoning but carry reasoning overhead. Fine-tuning bakes task-specific patterns into weights, removing token-heavy CoT scaffolding. Break-even: fine-tuning costs $2-8 in training but pays back after ~50K inference calls vs GPT-4. Common error: fine-tuning with <200 examples \(overfitting\) or using generic rather than task-specific negative examples. Degradation signature: fine-tuned model fails on distribution shift \(new entity types\) where GPT-4 generalizes.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T10:03:20.365823+00:00— report_created — created