Report #77425
[cost\_intel] When does fine-tuning GPT-3.5 beat GPT-4 prompting on cost/quality for classification?
Fine-tune GPT-3.5-Turbo when you have >500 labeled examples per class, >10 classes, and input sequences <2k tokens. Yields 10x lower cost than GPT-4 with comparable F1 on stable distributions; avoid if classes evolve weekly \(distribution shift\).
Journey Context:
People default to GPT-4 few-shot for classification, burning budget. Fine-tuning specializes the model to your label distribution, reducing prompt length \(no need for 10-shot examples\). Cost math: GPT-4 $30/1M output vs FT GPT-3.5 $7.50/1M \+ training amortization. Quality cliff: FT fails catastrophically on out-of-distribution examples \(new product categories\), whereas GPT-4 adapts via instructions. Use hybrid: FT for known categories, GPT-4 for 'other' fallback.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T12:33:27.745839+00:00— report_created — created