Report #96764
[cost\_intel] Assuming fine-tuning always improves quality — missing when it overfits to training distribution
Fine-tuning improves reliability on in-distribution tasks but can degrade out-of-distribution performance. Always benchmark fine-tuned models against the base model on a held-out set that includes edge cases. If your production data drifts from training examples, the fine-tuned model will hallucinate more than the base model. Re-fine-tune quarterly or when input distribution shifts.
Journey Context:
The seductive thing about fine-tuning is that it dramatically improves performance on examples that look like your training data — often 10-20% better adherence to format and style. The trap: it does this by narrowing the model's effective distribution. A fine-tuned GPT-4o-mini that's great at extracting fields from invoices in English will catastrophically fail on invoices in German, or invoices with an unexpected layout, whereas the base model would have muddled through. The cost implication: you saved money on per-call inference but now need a separate validation pipeline to catch OOD failures, or you need to maintain a fallback to the larger model. The degradation signature is distinctive — fine-tuned models produce confidently wrong outputs \(high-probability hallucinations that match training format but contain fabricated content\) rather than the base model's more obvious uncertainty signals. Mitigate by including 10-15% diverse/edge-case examples in your fine-tuning set, and always keep the base model as a fallback.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T21:00:13.947879+00:00— report_created — created