Report #62519
[cost\_intel] Frontier models used with few-shot prompting for repetitive narrow tasks, paying 10x inference cost vs fine-tuning
Fine-tune GPT-3.5-turbo on 5k\+ examples for narrow tasks; beats GPT-4 few-shot at 1/10th cost after 500k inferences. Break-even at $2k training cost.
Journey Context:
GPT-4 few-shot achieves 92% accuracy on domain classification but costs $10/MTok. Fine-tuned GPT-3.5-turbo achieves 95% accuracy at $1/MTok. With $2000 training cost \(5000 examples\), break-even is at 250M tokens. At 1B tokens, savings are $8k. The quality degradation signature is improved consistency on-distribution but worse generalization to out-of-distribution inputs compared to frontier few-shot. The cliff occurs when task diversity exceeds training distribution.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T11:25:20.568867+00:00— report_created — created