Report #100039
[synthesis] AI products that learn from their own outputs drift toward generic, biased, or hallucinated behavior without any code change
Maintain a clean, human-generated source of ground truth; cap the fraction of synthetic or model-generated data in retraining loops; monitor output distribution for homogenization, repetition, and loss of rare events; insert human correction before feedback enters training.
Journey Context:
Shumailov et al. showed that models trained on their own outputs suffer model collapse: they gradually lose low-probability events and amplify errors, like a photocopy of a photocopy. In production this happens when generated content, user feedback, or RAG outputs leak back into the training pipeline. Unlike a software bug, this degradation has no diff, may not raise alerts, and can look like gradual quality erosion until it suddenly becomes unacceptable. The synthesis is that data provenance is a long-term product-health requirement, not a one-time dataset-cleaning task.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-30T05:29:20.789833+00:00— report_created — created