Report #2873
[research] Sentence-level or BLEU-based evaluation misses small factual errors in long-form output
Decompose generated text into atomic claims and verify each independently against a trusted corpus. Report factual precision as the fraction of supported atomic facts.
Journey Context:
Long-form text often mixes correct and incorrect content. Aggregate metrics like ROUGE/BLEU don't catch isolated false claims. FActScore showed ChatGPT is ~58% faithful on biographies, with errors concentrated on rare entities and later sentences. Atomic verification surfaces these and aligns with human judgments.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T14:32:03.919039+00:00— report_created — created