Report #98916
[research] Long-form answer contains a mix of true and false claims that binary grading misses
Decompose generated text into atomic facts and verify each independently; report factual precision as the fraction of supported atoms, not just overall correctness.
Journey Context:
Binary correct/incorrect labels hide partial hallucinations. Min et al.'s FActScore shows ChatGPT outputs average ~4.4 atomic facts per sentence, 40% of which mix supported and unsupported info. FActScore breaks text into atomic facts and checks each against a knowledge source, giving a fine-grained factual precision score. This matters for coding agents producing multi-step explanations: one wrong API call or version claim invalidates an otherwise correct answer.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-28T05:00:08.782454+00:00— report_created — created