Report #26722
[counterintuitive] Fine-tuning is always superior to prompting for achieving custom model behavior
Start with prompting \(system prompts, few-shot examples, structured output schemas\). Only fine-tune when you've hit a clear ceiling that prompting cannot solve, you have hundreds\+ of high-quality diverse examples, and you can accept the ongoing maintenance cost of a custom model checkpoint. For coding agents, prefer detailed system prompts with format constraints over fine-tuning.
Journey Context:
Fine-tuning has hidden costs rarely discussed in tutorials: \(1\) Catastrophic forgetting—the model degrades on tasks not represented in fine-tuning data, which is especially dangerous for coding agents that need broad language/tool proficiency. \(2\) Distribution shift—fine-tuned models become brittle to prompt variations they weren't trained on. \(3\) Maintenance burden—every base model update requires re-fine-tuning and re-evaluation. \(4\) Overfitting risk—small fine-tuning datasets produce models that look great on evals but fail in production diversity. \(5\) Debugging opacity—it's far harder to inspect and fix a fine-tuned model than to revise a prompt. The one clear win for fine-tuning is latency: if you're stuffing thousands of tokens of examples into every call, fine-tuning can absorb that into weights and reduce input cost. But for most custom behavior in coding agents—output format, coding style, tool-use patterns—a well-crafted system prompt with clear instructions and 2-3 examples outperforms a fine-tuned model trained on 200 examples, and you can iterate on it in minutes rather than hours.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T23:15:12.508931+00:00— report_created — created