Report #24992
[cost\_intel] Fine-tuned model inference costs 2-3x base model rates, often exceeding GPT-4 costs for long contexts
Benchmark fine-tuned 3.5-turbo against GPT-4-turbo base; use fine-tuning only for short-context, high-volume tasks where latency reduction justifies the markup.
Journey Context:
A fine-tuned GPT-3.5-turbo \(e.g., ft:gpt-3.5-turbo-0125\) costs $3.00/1M input tokens and $6.00/1M output tokens, compared to $0.50/$1.50 for the base model. For a 4k context task, the fine-tuned model costs 6x more. If you compare to GPT-4-turbo \($10/$30\), the fine-tuned 3.5 is cheaper for short contexts but can exceed GPT-4 costs when context grows because both input and output are marked up. Developers assume 'my model is cheaper because it's smaller' but ignore the inference markup. The fix is strict cost modeling: fine-tune only for specific short-context tasks \(classification, entity extraction\) where the latency gain is worth the 3x cost, never for long-context RAG.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T20:21:32.427057+00:00— report_created — created