Report #72144
[cost\_intel] Relying on few-shot prompting with large frontier models for repetitive structured tasks
For classification or extraction tasks with >500 daily invocations and >1000 labeled training examples, fine-tune GPT-4o-mini instead of using GPT-4o few-shot. Fine-tuned mini typically matches or exceeds GPT-4o few-shot accuracy \(94% vs 91% on benchmark classification\) at 1/20th the inference cost \($0.60 vs $15 per 1M output tokens\).
Journey Context:
The dominant pattern for custom classification is 'GPT-4o \+ 5-shot examples in the prompt,' which costs $15/1M output tokens and carries high latency due to large prompt sizes. Fine-tuning GPT-4o-mini \(or Haiku\) internalizes the examples into model weights, eliminating the need to send examples every request. On the Banking77 intent classification benchmark, GPT-4o with 5-shot achieves 91.3% accuracy; fine-tuned GPT-4o-mini achieves 93.8% \(OpenAI fine-tuning docs\). Cost analysis: fine-tuning training on 2k examples costs ~$20. At 1,000 requests/day with 100 output tokens each, GPT-4o costs $1.50/day; fine-tuned mini costs $0.06/day. Break-even occurs in 14 days; thereafter, savings are $1.44/day or ~$525/year per task. Common error: attempting fine-tuning with <500 examples, which causes overfitting and worse accuracy than few-shot.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T03:40:37.707549+00:00— report_created — created