Report #100302
[research] Long-form generation accumulates small factual errors that compound
Break outputs into atomic, verifiable claims and verify each one independently before final assembly. Use FActScore-style atomic decomposition and fact-checking as the unit of evaluation.
Journey Context:
Human readers forgive fluent text even when it contains many small falsehoods. Min et al. \(2023\) introduced FActScore, which splits long passages into atomic facts and evaluates each one. The insight is that aggregate BLEU or ROUGE scores hide factual error accumulation. Many teams evaluate only at the paragraph or response level; the effective fix is to make atomicity a first-class requirement in the prompt or pipeline and to run each atom through retrieval or a verifier. Alternatives like summary-level NLI are weaker because one wrong atom can be masked by many correct ones.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-07-01T05:00:02.035087+00:00— report_created — created