Report #99907
[counterintuitive] Fine-tuning always beats prompting for customizing LLM behavior
Start with prompt engineering, retrieval, and tool use; use fine-tuning only when you have hundreds of high-quality examples, the task is stable, and you need to hardcode style, format, or domain patterns that prompting cannot reliably elicit.
Journey Context:
Gudibande et al.'s 'The False Promise of Imitating Proprietary LLMs' showed that fine-tuning smaller models on outputs from larger models mostly imitates style, not factuality, and can even degrade truthfulness by teaching the model to confidently mimic answers beyond its knowledge. Fine-tuning is powerful for formatting, tone, and stable task patterns, but it cannot inject missing world knowledge and is expensive to maintain as models and requirements change. The right model is a hierarchy: prompting and retrieval first, then targeted fine-tuning for durable behavioral shaping.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-30T05:16:06.392944+00:00— report_created — created