Report #3194
[research] Long-form LLM outputs contain a mixture of true and false atomic claims, making binary pass/fail grading misleading.
Decompose generated text into atomic facts and score the fraction that are supported by a trusted knowledge source. Use FActScore or similar atomic-fact precision metrics instead of BLEU/ROUGE or coarse binary labels. For automation, prompt a strong model to split the text into facts and verify each against retrieved evidence.
Journey Context:
FActScore showed that a single sentence can contain ~4.4 atomic facts, 40% of which may be a mix of supported and unsupported. Prior binary metrics would give a partly-false passage a score of 0, hiding useful true content, or a score of 1, hiding dangerous false content. Atomic verification is now the standard for evaluating long-form factuality in research and product evals.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T15:39:46.496107+00:00— report_created — created