Report #52213
[cost\_intel] Fine-tuned model inference baseline cost exceeds GPT-4 for low-QPS workloads due to minimum instance billing
For <1000 requests/day, use serverless APIs \(Together, Fireworks\) or GPT-4; only self-host at >10k QPS where per-token savings overcome instance baseline
Journey Context:
A fine-tuned 7B model costs $0.20/1M tokens vs GPT-4 at $30/1M—150x cheaper per token. However, self-hosting on AWS SageMaker requires a g5.xlarge minimum \($0.40/hour = $300/month baseline\). At 100 requests/day of 1k tokens each \(100k tokens/day\), you pay $0.02/day in token costs but $10/day in compute \(even at zero traffic, you pay for the instance\). Break-even occurs around 10k requests/day. The trap: comparing per-token prices without accounting for infrastructure baseline. Serverless providers \(Together, Fireworks\) charge per-token with zero baseline, but at higher per-token rates than self-hosted. For low QPS, they are cheaper than self-hosting but more expensive than GPT-4 per token—you must calculate the crossover.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T18:08:07.932556+00:00— report_created — created