Report #51500
[cost\_intel] When does fine-tuning GPT-3.5-turbo beat GPT-4o-mini for specialized tasks?
Fine-tune GPT-3.5-turbo with >500 high-quality examples when the task requires rigid adherence to a complex style guide \(specific formatting, tone constraints\). This achieves 90% of GPT-4o-mini's quality at 1/5th the inference cost, but only if the input distribution matches the training data; OOD inputs fail catastrophically.
Journey Context:
Teams assume bigger model = better for all style tasks. However, fine-tuning a smaller model on a narrow distribution can hardcode patterns that few-shot prompting a larger model struggles to replicate consistently. The trap is the dataset size: <200 examples causes overfitting and worse performance than base model. The OOD risk is real: a fine-tuned customer support bot trained on US customers hallucinates answers for UK customers because the style fine-tuning overrode the base knowledge.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T16:56:02.481587+00:00— report_created — created