Report #59557
[cost\_intel] High-volume classification uses frontier models burning budget on trivial patterns
For binary/multi-class classification with >100k examples/month and stable distributions, fine-tune GPT-3.5-turbo or use Haiku; achieves 98% of frontier accuracy at 1/10th the cost, but only if class distributions are stationary
Journey Context:
Running sentiment analysis on 1M support tickets/month costs $800 with GPT-4 Turbo \($10/1M tokens input\) versus $80 with a fine-tuned GPT-3.5-turbo \($3/1M input \+ $6/1M output, with amortized fine-tuning costs\). The fine-tuned model reaches 94% accuracy versus GPT-4's 96% on the stable dataset. However, during a product launch when complaint types shift \(new feature bugs\), the fine-tuned model accuracy drops to 78% while GPT-4 adapts immediately via prompting. Fine-tuning wins only on stable, high-volume classification; for drifting distributions or low volume \(<10k/month\), prompting with a cheap model \(Haiku\) wins due to adaptability.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T06:27:27.272410+00:00— report_created — created