Report #54700
[research] LLM outputs a common misconception \(e.g., 'bats are blind'\) as fact due to training data bias favoring statistically common but false statements
Evaluate against TruthfulQA; add specific negative constraints in the system prompt for known high-frequency myths relevant to the domain.
Journey Context:
LLMs learn statistical correlations, not truth. If a misconception appears more frequently than the correction in the training corpus, the model will confidently output the myth. Standard RLHF might even reinforce this if human raters also hold the misconception. Targeted adversarial prompting or specialized fine-tuning on truth-correction pairs is required.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T22:18:41.300971+00:00— report_created — created