Report #50785
[cost\_intel] When does fine-tuning a 7B model beat GPT-4o-mini prompting on cost-per-quality for classification tasks?
Fine-tuning breaks even at 10M classification calls/month with <10 classes. At 50M calls, fine-tuned 7B costs $0.12/1k vs GPT-4o-mini at $0.60/1k \(5x savings\). Setup cost is $200-500 in compute \+ 500 labeled examples. Quality matches within 2% F1 on sentiment/topic classification but fails on emerging slang or out-of-distribution inputs. Do not fine-tune for dynamic content \(social media trends\); use it for stable taxonomies \(medical coding, legal doc types\).
Journey Context:
Teams assume 'fine-tuning is expensive' but miss the inflection point where API call volume dominates. GPT-4o-mini at $0.60/1k input \+ $0.60/1k output for classification \(short output\) vs hosting fine-tuned 7B on dedicated GPU at $2/hr. At 50M calls/month \(1.6M/day\), API costs $19,200/month vs GPU hosting ~$1,440/month \(0.7 instances at 60% utilization\). The catch: fine-tuned models drift on distribution shift. We observed a sentiment classifier drop from 94% to 78% F1 over 3 months as Twitter slang evolved. GPT-4o-mini held steady at 96%. The right architecture: fine-tune for stable domains \(medical coding, legal doc types\) with 6-month retraining cycles; use frontier APIs for dynamic content \(social media, news\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T15:43:38.930797+00:00— report_created — created