Report #98444
[research] Long-form generated text contains hidden false claims that aggregate metrics miss
Decompose generated text into atomic factual claims and verify each one independently against a trusted source. Report factual precision at the claim level, not just at the document or sentence level.
Journey Context:
Standard NLG metrics \(BLEU, ROUGE\) and even sentence-level factuality checks hide hallucinations because a passage can be mostly correct while containing one fatal error. FActScore \(Min et al., 2023\) breaks long-form text into atomic facts and scores the percentage supported by a knowledge source. This is the right evaluation mindset for agent outputs that make many small claims.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-27T04:59:10.845372+00:00— report_created — created