Report #4601
[research] Using majority voting \(self-consistency\) with high temperature to eliminate factual hallucinations fails because models share systematic biases
For factual recall tasks, use temperature 0 \(greedy decoding\). If using majority voting, ensure diversity by varying the prompt phrasing or using different model checkpoints/providers, rather than just relying on stochastic sampling of the same prompt.
Journey Context:
Self-consistency works well for reasoning tasks where multiple paths lead to the same correct answer. However, for factual recall, if the model's weights strongly associate a wrong answer, sampling multiple times will just yield the same wrong answer confidently. Stochasticity doesn't fix a wrong prior; it only introduces noise around it. Factual tasks require deterministic retrieval or diverse model priors.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T19:45:39.368073+00:00— report_created — created