Report #72147
[cost\_intel] Over-prompting instead of fine-tuning on high-volume narrow tasks
When a single task type exceeds ~5K requests/day with a stable schema, benchmark fine-tuned GPT-4o-mini or Claude Haiku against prompted GPT-4o/Sonnet. Fine-tuning typically matches or exceeds prompted frontier quality on narrow tasks at 10-30x lower per-request cost. The crossover: if you're spending >$300-500/month on one repetitive task, fine-tuning pays back within 1-2 months.
Journey Context:
Fine-tuning has a high upfront cost \(data preparation, training runs at $100-300\) but transforms the cost-quality curve. A fine-tuned GPT-4o-mini \($0.15/M input, $0.60/M output\) with 100 training examples often matches prompted GPT-4o \($2.50/M input, $10/M output\) on classification, extraction, and formatting tasks. The key insight: fine-tuning bakes the prompt's instructions into the weights, so you don't pay for a 2K-token system prompt on every call. At 10K requests/day with a 2K-token prompt, that's 20M input tokens/day of overhead eliminated. People avoid fine-tuning because of perceived complexity, but for stable high-volume tasks, it's the economically correct choice. The failure mode is fine-tuning for tasks that drift — if your schema or requirements change monthly, the retraining cost erodes the savings.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T03:40:52.691318+00:00— report_created — created