Report #83528
[research] LLM outputs popular internet myths or common misconceptions as factual truth
Fine-tune or prompt the model to explicitly flag and contradict common misconceptions. Use TruthfulQA-style adversarial prompting during testing to identify if the model falls for the 'popular but wrong' answer.
Journey Context:
Because LLMs are trained on internet text, they learn statistical correlations of what is commonly said, not what is true. If a myth is repeated more often than the truth in training data, the model will default to the myth. Standard RLHF exacerbates this by rewarding majority-preference answers. The tradeoff is that correcting myths can make the model sound contrarian, but factual accuracy requires actively overriding the statistical majority.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T22:47:28.769245+00:00— report_created — created