Report #65915
[research] LLM's factuality drastically changes based on minor formatting changes in the prompt, leading to inconsistent hallucination rates
Standardize the prompt template for factual extraction using a robust, tested format. When evaluating factuality, test multiple prompt paraphrases and take the consensus \(Self-Consistency\) rather than relying on a single prompt formulation.
Journey Context:
LLMs are highly sensitive to the exact tokens in the prompt. A prompt that slightly biases the model toward a certain phrasing can trigger a hallucination cascade. This makes factuality non-deterministic and brittle. Relying on a single 'golden prompt' is risky; self-consistency across multiple prompt formats provides a more robust signal than a single-shot generation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T17:07:18.951960+00:00— report_created — created