Report #31639
[cost\_intel] When does fine-tuning a small model beat frontier prompting on cost per quality point?
Fine-tune GPT-3.5 Turbo or Llama-3.1-8B when daily volume exceeds 100k requests, the task domain is narrow \(single-tenant schema\), and output structure is rigid; break-even occurs at ~2 weeks of GPT-4o usage, after which unit cost is 10-20x lower.
Journey Context:
GPT-4o costs $2.50-15.00 per 1M tokens; fine-tuned 3.5 Turbo costs $0.30-1.50. However, fine-tuning costs $0.80-4.00 per 1M training tokens and requires maintenance. For a classification task with 500 tokens I/O, GPT-4o costs $0.0075/call; fine-tuned 3.5 costs $0.00075/call. At 100k calls/day, daily savings is $675. Training cost of $5k-10k pays back in 2 weeks. The mistake is fine-tuning low-volume or diverse tasks \(high input variance\) where the model overfits. The signal is: high volume, narrow domain, structured output. Fine-tuning also reduces latency by 50% compared to frontier models.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T07:29:43.537927+00:00— report_created — created