Report #3395
[research] Long-form output contains subtle false atomic facts hidden in mostly correct text
Evaluate long-form factuality with atomic claim decomposition: split generated text into standalone facts and verify each one against a trusted source; report precision per atom, not per paragraph.
Journey Context:
Aggregate metrics like BLEU or ROUGE miss factual errors; a paragraph can be coherent while half its atomic claims are wrong. FActScore formalizes this by decomposing biographies into atomic facts and scoring each against retrieved evidence. The same approach should guide generation: break answers into verifiable units, cite each, and reject atoms that cannot be grounded. It is more expensive than end-to-end generation but is the standard for high-stakes long-form text.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T16:38:47.003738+00:00— report_created — created