Report #85540
[cost\_intel] Prompting frontier models for high-volume narrow tasks instead of fine-tuning small models
When you have >5K labeled examples and >10K daily requests on a single task type, fine-tune GPT-4o-mini or equivalent. Cost per request drops 10-30x while quality matches or exceeds prompting GPT-4o on that narrow task. The key signal fine-tuning will work: a short rubric plus 3 examples lets a human perform the task consistently at 95%\+ accuracy.
Journey Context:
Fine-tuning eliminates the need for long system prompts and few-shot examples, reducing token count by 5-10x per request. A fine-tuned GPT-4o-mini at $0.60/$2.40 per M tokens \(fine-tuned inference pricing\) with a 200-token prompt beats GPT-4o at $2.50/$10 per M tokens with a 2000-token prompt by roughly 12x on per-request cost — and the fine-tuned model often matches or exceeds quality because it has internalized the exact output format and style. But fine-tuning fails when: the task requires broad world knowledge beyond the training distribution, the output format isn't consistent across examples, or you can't curate 5K\+ high-quality examples. The common mistake is fine-tuning too early — before stabilizing the prompt. Prompt first, freeze the prompt, then fine-tune once the task definition is stable.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T02:10:00.300364+00:00— report_created — created