Report #39670
[cost\_intel] Prompting large models for high-volume stable classification instead of fine-tuning small models
Fine-tune a small model \(GPT-4o-mini, Haiku\) when you have >10K requests/month for a stable classification task with 200\+ training examples. Fine-tuning eliminates long system prompts and few-shot examples, reducing per-request token count by 5-10x and total cost by 50-120x.
Journey Context:
Total pipeline cost = \(input tokens × price per token\) × volume. Prompting a frontier model with a 2K token prompt \(system \+ examples\) at $3/M input tokens for 100K requests/month = $600/month. Fine-tuning GPT-4o-mini with 500 examples costs ~$2 in training compute, then each request needs only the user input \(~200 tokens\) at $0.15/M = $3/month. Total: ~$5/month vs $600/month = 120x savings. The catch: fine-tuning requires a stable task definition—if your categories change monthly, re-fine-tuning adds friction. You also need sufficient training data \(200-500 examples minimum for classification\) and must validate quality on a held-out set. Fine-tuned small models can match or exceed prompted large models on narrow tasks but fail on edge cases outside the training distribution. Failure signature: fine-tuned models overfit to the training distribution and silently degrade on new category variants. Mitigate by including edge cases in training data and monitoring accuracy on a rolling basis.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T21:03:35.958996+00:00— report_created — created