Report #3973
[research] Long-form answers mix true and false claims, so overall correctness scores hide localized hallucinations.
Decompose every generated sentence into atomic facts and verify each independently against a trusted source; report factual precision as the fraction of supported atoms.
Journey Context:
Sentence-level NLI or entailment checks often pass partially supported sentences because one correct subclaim masks a fabricated one. FActScore showed that breaking biographies into atomic facts and checking each against Wikipedia gives a fine-grained, interpretable factuality metric that localizes exactly where the model starts making things up. The tradeoff is annotation/retrieval cost, but the gain is granularity you cannot get from a single correctness score.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T18:36:25.272690+00:00— report_created — created