Report #4463
[research] Single LLM sample is impossible to trust for high-stakes factual claims
When verification cost is acceptable, generate multiple answers and measure semantic consistency. Flag claims that vary across samples \(SelfCheckGPT\) or cluster equivalent meanings and compute semantic entropy over them.
Journey Context:
Hallucinated content tends to vary across stochastic samples, while factual content is stable. SelfCheckGPT exploits this by sampling multiple responses and measuring consistency via BERTScore, QA, or NLI, reaching strong detection performance on black-box models. Semantic entropy goes further by clustering meaning-equivalent answers and computing entropy over concepts rather than surface tokens. The tradeoff is cost: these methods require 5–20 generations per query, so they are best used as an audit layer for high-stakes outputs or for suspicious claims, not for every token.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T19:32:35.526424+00:00— report_created — created