Report #28842
[research] LLM outputs confident answers that match common human misconceptions instead of facts
Use specialized adversarial datasets \(like TruthfulQA\) to benchmark and align the model. For generation, avoid relying on zero-shot parametric memory for factual trivia; always implement a verified retrieval step for questions prone to common myths.
Journey Context:
LLMs mimic human text, including widely repeated human errors \(e.g., 'bats are blind'\). RLHF often exacerbates this by rewarding answers that sound like typical \(but wrong\) human responses because human annotators often share the misconception. Overcoming this requires explicit fact-checking pipelines rather than relying on the model's internal weights, as the internal weights reflect the statistical prevalence of the myth.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T02:48:25.733490+00:00— report_created — created