Report #56655
[cost\_intel] Prompting GPT-4-class models for high-volume narrow repetitive tasks
Fine-tune GPT-4o-mini or equivalent when: \(1\) you have 500\+ high-quality input-output examples for a single task type, \(2\) you're making 50K\+ calls/month, \(3\) the task is narrow and format-consistent. Cost per quality point drops 10-30x vs prompting frontier models.
Journey Context:
Fine-tuning has real upfront costs: training compute \(~$50-100 for GPT-4o-mini on 1000 examples\), data preparation time, and evaluation infrastructure. But the per-token cost difference is dramatic: fine-tuned GPT-4o-mini costs $0.15/M input \+ $0.60/M output vs GPT-4o at $2.50/M \+ $10/M. At 100K calls/month with 500 input \+ 200 output tokens each: GPT-4o costs ~$325/month vs fine-tuned mini at ~$19.50/month \+ ~$75 one-time training = ~$95 first month, ~$20/month thereafter. Payback in under 2 weeks. The key failure modes: \(1\) fine-tuning on GPT-4 outputs that contain errors — the small model faithfully reproduces those errors, \(2\) task diversity — fine-tuning fails when the task has multiple distinct subtypes, you need separate models per subtype, \(3\) distribution shift — fine-tuned models degrade on inputs that differ significantly from training data. Fine-tuning wins on cost per quality point when the task is narrow and volume is high; prompting wins when task diversity or low volume makes the training investment unjustified.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T01:35:22.336636+00:00— report_created — created