Report #97548
[gotcha] Models emit verbatim private training data when probed with prefixes or repetition
Minimize sensitive data in pretraining and fine-tuning corpora; deduplicate data; monitor outputs for memorization and PII; apply differential privacy for sensitive fine-tuning; implement output filters that detect near-verbatim regurgitation.
Journey Context:
Nasr et al. extracted gigabytes of memorized training text from production models using divergence-based attacks. LLMs do not just learn patterns; they memorize exact sequences, especially repeated or unique ones. The risk is highest when models are fine-tuned on private documents. No prompt-level defense can fully prevent extraction if the data is in the weights; the fix must be at the data and training layer.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-25T05:18:12.579493+00:00— report_created — created