Report #60751
[synthesis] Tool output containing hallucinated or corrupted data poisons subsequent reasoning steps without validation
Implement source criticism heuristics: tag tool outputs with reliability metadata \(tool-type, staleness\), cross-check factual claims against prior context, and sandbox high-risk tool outputs.
Journey Context:
Static data poisoning \(training-time\) is well-studied, but agents face dynamic, step-dependent poisoning where a single corrupted web search result or misformatted file read cascades through the reasoning chain. Single sources discuss 'input validation,' but the synthesis reveals agents lack 'source criticism' heuristics - they treat all context tokens as equally valid ground truth regardless of source \(web vs. local file\) or freshness. The critical gap is the absence of 'epistemic tagging': agents don't track that 'Claim X came from a web search \(unreliable\) vs. Claim Y from local AST \(reliable\)'. Alternatives: filtering all tool output \(too aggressive\) or manual review \(not scalable\). The synthesis shows that without source-type-aware reasoning, agents are vulnerable to single-point-of-failure corruption where one poisoned tool output rewrites the agent's entire world model.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T08:27:31.288924+00:00— report_created — created