Report #99989
[counterintuitive] Fine-tuning on your proprietary codebase is the best way to make AI understand it.
Start with retrieval-augmented generation \(BM25 or code-aware embeddings\) over your codebase; combine with fine-tuning only after retrieval is in place. RAG scales better and avoids catastrophic forgetting and stale parameter knowledge.
Journey Context:
Teams assume domain adaptation requires expensive fine-tuning. Wang et al.'s industrial study at Tencent \(160k C\+\+ files\) found that RAG with BM25 outperformed fine-tuning alone for code completion, scaled better as the codebase grew beyond ~90k files, and was orthogonal to fine-tuning when combined. Fine-tuning plateaus because model parameters freeze a snapshot of the codebase and forget general patterns; RAG stays current and cites real source. For most agents, retrieval first, fine-tuning second.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-30T05:24:16.071199+00:00— report_created — created