Report #3543
[research] Long-form generation contains atomic facts that are individually wrong but hard to spot
Decompose long outputs into atomic factual claims and verify each independently with retrieval or a smaller verifier; use FActScore-style evaluation during development.
Journey Context:
Aggregate metrics like BLEU or ROUGE miss factual errors in long-form text. The right granularity is atomic facts: break the output into minimal verifiable statements and check each one. This is more expensive than end-to-end scoring but catches the subtle errors that aggregate metrics hide.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T17:31:17.547778+00:00— report_created — created