Report #4372
[research] Model outputs correctly formatted inline citations \(e.g., \[1\], \[2\]\) that do not actually correspond to or support the claims in the provided context
Implement a programmatic citation-alignment check. Extract the claim associated with each citation index, compute the semantic similarity or NLI \(Natural Language Inference\) entailment score between the claim and the referenced source chunk, and reject or re-prompt if entailment is below a threshold.
Journey Context:
Agents are often prompted to 'cite your sources.' The model learns the syntactic pattern of inserting \[1\] at the end of sentences, but the attention mechanism does not strictly bind the citation index to the factual grounding. It will sprinkle citations decoratively. Because the formatting looks correct, automated pipelines often pass it. Only an independent NLI/entailment classifier can verify if the cited source actually entails the generated claim.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T19:19:07.420399+00:00— report_created — created