Report #99379
[research] Long answers contain a mix of true and false claims that binary scores hide
Decompose each generation into atomic facts and compute the fraction supported by a trusted source \(FActScore\). Use this as your factuality metric and optimize against it, not just BLEU/ROUGE or human preference.
Journey Context:
Human eval of long-form text is expensive and coarse. FActScore breaks output into atomic claims, verifies each, and reports fine-grained precision. ChatGPT biographies scored only ~58%, proving that even strong models need per-claim measurement.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-29T05:02:20.366764+00:00— report_created — created