Report #99528
[counterintuitive] Retraining or fine-tuning on AI-generated code closes the capability gap.
Curate high-quality human data and synthetic data with strong filters; generated-data loops degrade tail performance, amplify errors, and cause model collapse.
Journey Context:
Model-collapse research shows that training on generated output progressively reduces variance and erodes low-probability but important behaviors. Code-generation pipelines that feed their own outputs back into training without validation see bug propagation and reduced robustness. Synthetic data can help only when it is independently verified and mixed with real data.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-29T05:17:26.571674+00:00— report_created — created