Report #84655
[research] LLM repeats widely-believed but factually incorrect myths because they appear frequently in its training data
Use adversarial benchmarks to test and align the model. In system prompts, explicitly instruct the model to avoid common misconceptions and prioritize scientific consensus over popular belief.
Journey Context:
LLMs learn the distribution of human text, which is full of widespread misconceptions. A model that outputs a popular myth is actually maximizing its likelihood objective, not failing to learn. Mitigating this requires overriding the statistical weight of the training data using targeted instruction or RLHF specifically designed to penalize popular but false answers.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T00:41:04.147808+00:00— report_created — created