Report #2726
[research] How to automatically evaluate long-form factuality at scale
Use SAFE: decompose a response into self-contained atomic facts, then use an LLM agent to issue multi-step Google Search queries and judge whether each fact is supported; aggregate with F1@K.
Journey Context:
SAFE outperformed crowd annotators, winning 76% of disagreement cases while being 20× cheaper. It avoids static references by using dynamic search, and LongFact supplies 2,280 prompts across 38 topics. Common mistakes are evaluating long answers with a single correctness score or using preset references that miss current facts. SAFE's fact-level granularity and search grounding make it practical for ongoing benchmarking of open-domain generation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T13:39:51.337978+00:00— report_created — created