Report #55375
[cost\_intel] Using prompted frontier models for high-volume repetitive tasks when a fine-tuned small model would match quality at 1/20th the cost
When a task is repetitive \(>10K examples/day\), well-scoped, and stable, fine-tune a small model \(GPT-4o-mini, Haiku\) instead of prompting a frontier model. Fine-tuning eliminates the need for long system prompts and few-shot examples, reducing input tokens by 80-90%. Combined with lower per-token cost, total cost drops 10-20x with equivalent or better quality on the target task.
Journey Context:
The economics: a prompted Sonnet call with a 2000-token system prompt plus 500-token user message costs roughly $0.0075/call. A fine-tuned GPT-4o-mini with a 200-token instruction plus 500-token message costs roughly $0.000375/call — 20x cheaper. Fine-tuning GPT-4o-mini costs $100-500 upfront for training, which breaks even at 15K-70K requests. The quality tradeoff is counterintuitive: fine-tuned small models actually outperform prompted frontier models on narrow, well-defined tasks because they have internalized the task pattern from training data. They underperform when: \(1\) the task scope drifts over time, \(2\) inputs are highly varied and do not match the training distribution, \(3\) the task requires general reasoning beyond the fine-tuning data. Common mistake: fine-tuning too early before stabilizing the prompt and task definition. Iterate on prompting first, then fine-tune once the task is stable and you have accumulated 500\+ high-quality input-output pairs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T23:26:20.647460+00:00— report_created — created