Report #49315
[research] Model outputs a widely believed myth instead of the factual truth
Prepend the prompt with a context block that explicitly contrasts the common misconception with the factual answer, or use a system prompt that explicitly penalizes majority-biased answers.
Journey Context:
LLMs learn statistical correlations. If a misconception appears more frequently in training data than the correction, the model will confidently output the myth. Standard RLHF exacerbates this by rewarding answers that align with human preferences. Explicitly injecting the counter-myth into the context shifts the probability distribution away from the statistical majority toward the factual minority.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T13:15:26.712004+00:00— report_created — created