Report #3987
[research] A response can contain real-looking citations that do not actually support the claims next to them.
Evaluate citation precision \(every cited passage must entail the claim it backs\) and citation recall \(every verifiable claim must have a citation\), separately from fluency and answer correctness.
Journey Context:
Retrieval alone does not guarantee attribution; models frequently cite topically relevant passages that do not entail the generated sentence. ALCE introduced automatic citation F1 over NLI-based precision and recall and showed that even strong systems leave many claims unsupported. Measuring both dimensions prevents the common failure mode where a system looks well-cited while silently synthesizing unsupported content.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T18:37:25.735580+00:00— report_created — created