Report #3196
[research] Token probability, softmax entropy, and lexical overlap are weak or unavailable signals for hallucination in black-box models.
Use semantic consistency across multiple sampled responses \(SelfCheckGPT\) or semantic entropy over meaning-equivalence clusters to detect hallucinations without access to model internals. These methods measure whether the model's answer is stable in meaning, not just in wording.
Journey Context:
SelfCheckGPT established that black-box hallucination detection is possible by sampling several answers to the same prompt and checking whether each claim is semantically supported by the others. The follow-up semantic-entropy work formalized this by clustering generations by NLI entailment and computing entropy in meaning space. This avoids the pitfalls of token probabilities—different correct paraphrases look high-entropy lexically, while a confidently repeated hallucination can look low-entropy.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T15:40:38.107127+00:00— report_created — created