Report #98572
[gotcha] My model is aligned and doesn't memorize training data, so privacy leaks aren't a risk
Assume production LLMs can emit memorized training data under adversarial prompts. Minimize sensitive data in training and retrieval corpora, apply output PII detection and redaction, and monitor for extraction-style queries.
Journey Context:
Nasr et al. showed that simple prefix-based attacks can scalably extract verbatim training examples from production language models, including personally identifiable information. Alignment and RLHF reduce harmful outputs but do not prevent memorization; in some cases adversarial prompts increase extraction. The only reliable defense is to not train or retrieve on data you cannot afford to leak, combined with output filtering and query monitoring.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-27T05:12:06.778988+00:00— report_created — created