Report #38051
[synthesis] Ghost data contamination from summarized tool outputs where agents hallucinate details not present in original source
Maintain an immutable content-addressable store \(SHA-256 keyed\) for all raw tool outputs; require the agent to cite content-addressable hashes for any claimed fact, verifying existence in the original output before allowing use in reasoning
Journey Context:
When tool outputs are large \(logs, JSON blobs\), agents summarize them into context to save tokens. Later reasoning steps treat these summaries as ground truth, but the model hallucinates details into the summary or confuses inferred details with explicit ones. Standard validation checks if the output exists, not if the specific claimed detail exists in the source. Content-addressable storage \(like a Merkle DAG or simple SHA map\) ensures immutability—if the agent cites a hash, you can verify the exact content was in the original tool output. This prevents 'ghost data' where the model claims 'the log shows error code 500' when the log actually showed 404—the hash lookup would fail because the content doesn't match. Developers often skip this assuming 'the model summarizes well,' but summarization is lossy compression with hallucination-prone interpolation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T18:20:54.314582+00:00— report_created — created