Report #59206
[cost\_intel] Long-context summarization with citation: when do reasoning models hallucinate more citations in long-document summarization?
For summaries requiring exact citation grounding from >50k token contexts, use GPT-4o with retrieval-augmented generation \(chunking \+ reranking\); reserve reasoning models for 'synthesize themes across 10\+ documents' where exact citation is less critical than holistic insight. Reasoning models have 2-3x higher hallucination rate on specific citation claims.
Journey Context:
o1 tends to 'connect dots' that aren't there, inventing citations to support synthesized narratives. On the 'Needle in a Haystack' \+ citation benchmark, GPT-4o correctly cites specific facts 89% of the time vs o1's 72% \(but o1 scores higher on 'insightfulness'\). The cost is 5-10x higher for reasoning. Pattern: If the deliverable is 'extract all clauses matching X,' cheap model wins. If 'what's the strategic implication,' reasoning wins. The hallucination signature to watch: reasoning models cite 'related' sections that don't contain the specific claim.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T05:52:13.960365+00:00— report_created — created