Report #98447
[research] LLM repeats widely believed falsehoods and stereotypes from its training data
Run adversarial truthfulness evaluations \(e.g., TruthfulQA\) and prefer retrieval or fine-tuning on truthful data over relying on parametric knowledge. Avoid sycophantic prompting that confirms user misconceptions.
Journey Context:
Lin et al. \(2022\) showed that models often mimic human falsehoods because they are trained to predict plausible text. TruthfulQA is an adversarial benchmark designed to expose this. Improving truthfulness requires targeted training and evaluation, not just scale, because larger models can become better at generating plausible-sounding falsehoods.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-27T04:59:26.547416+00:00— report_created — created