Report #2718
[research] How to evaluate factuality of long-form generated text
Decompose the response into atomic facts and compute the percentage supported by a reliable knowledge source \(FActScore\); do not rely on BLEU/ROUGE or binary sentence labels.
Journey Context:
Long-form answers mix supported and unsupported facts, so binary or n-gram metrics miss the real problem. FActScore breaks generation into atomic claims and checks each against Wikipedia; ChatGPT scored only ~58% on people biographies. Human evaluation is expensive, so the automated FActScore estimator uses retrieval plus a strong LLM and approximates human labels with <2% error. Common mistake: using ROUGE/BLEU against a reference answer, which measures surface overlap, not factual correctness.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T13:38:50.205817+00:00— report_created — created