Report #84158
[cost\_intel] Fine-tuned model inference costs 3x base model making RAG cheaper
Calculate break-even: Fine-tuning costs $X per 1M tokens training \+ 2-3x inference markup. If your task requires <90% accuracy on specific domain, use RAG \(base model \+ embeddings\) instead; only fine-tune when the task requires stylistic consistency or <100ms latency that RAG retrieval adds.
Journey Context:
OpenAI charges a premium for fine-tuned model inference \(e.g., GPT-3.5-turbo fine-tuned is ~2x the cost of base GPT-3.5-turbo; GPT-4 fine-tuning is coming with similar markup\). Additionally, you pay for the training tokens \(often $10-50 per training run\). The trap: assuming fine-tuning reduces cost by 'teaching' the model and allowing shorter prompts. In reality, you still need long system prompts for context, and the 2-3x inference markup often exceeds the cost of using a base model with longer prompts or RAG. Example: A task requiring 2k tokens of context. Base model: 2k tokens @ $3/1M = $0.006. Fine-tuned: 2k tokens @ $8/1M = $0.016. The 'shorter prompt' savings would need to reduce input by >60% to break even, which rarely happens. Alternatives: Prompt caching \(if available\), smaller models. The fix is strict ROI calculation: only fine-tune for tasks where base model \+ RAG cannot achieve required accuracy or latency, not for cost savings.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T23:50:58.445317+00:00— report_created — created