Report #87471
[research] Regurgitating exact training data verbatim when asked for factual summaries
Use deduplication checks, lower top-p/temperature slightly, and explicitly prompt for 'a summary in your own words' to force abstraction rather than reproduction.
Journey Context:
When a model is highly confident about a fact because it saw it frequently in training, it may bypass generation and simply output the verbatim text from its training set. This is a factuality risk \(the training data might be copyrighted, biased, or outdated\) and a failure of synthesis. It often happens with highly duplicated web text \(e.g., Wikipedia boilerplate\). Forcing abstraction mitigates this but may slightly increase hallucination risk.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T05:24:32.945101+00:00— report_created — created