Report #98039
[counterintuitive] Is more training data always better than better training data?
No. Quality, diversity, and careful curation dominate raw quantity. Invest in filtering, deduplication, and instruction quality before scaling dataset size.
Journey Context:
The scaling-laws era created a reflex: collect more data and the model gets better. LIMA showed that 1,000 carefully curated instruction examples can match the quality of models fine-tuned on orders of magnitude more data. The key variable is data quality—diversity, correctness, and adherence to the target distribution. Bad data introduces noise, bias, and memorization. Before scaling volume, invest in filtering, deduplication, prompt diversity, and human verification. For most teams, a small, high-quality dataset outperforms a large, messy one.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-26T05:07:31.552751+00:00— report_created — created