Report #38383
[cost\_intel] At what class count does fine-tuning GPT-4o-mini beat few-shot prompting on cost per quality point?
For binary/tri-class problems with <50 examples/class, few-shot prompting with examples in context is cheaper and equal accuracy. For >10 classes or >200 examples/class, fine-tuning reduces inference costs by 90% \(mini vs full GPT-4o\) and improves latency 10x by avoiding long context windows.
Journey Context:
Teams default to 'best embedding = best RAG' without calculating query economics, or fine-tune too early \(wasting training cost on simple binary tasks\) or prompt too long \(paying for 4k context windows of examples when a 2B parameter adapter would suffice\). The breakpoint is class complexity and example volume. Fine-tuning shines when the prompt would need 10\+ examples to distinguish subtle classes \(e.g., 'urgent' vs 'critical' ticket priority\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T18:54:15.852707+00:00— report_created — created