Report #82788
[synthesis] Agent makes a catastrophic tool call due to cascading context drift where a minor hallucination becomes an assumed fact
Enforce 'read-only' phases where the agent must output its intended destructive command and a justification, which is evaluated against the original goal by a separate, smaller, highly-constrained evaluator model before execution is permitted.
Journey Context:
Agents don't fail catastrophically out of nowhere. It starts with a small hallucination \(e.g., assuming dev environment instead of prod\), which influences subsequent tool calls. By step 5, the agent is confidently operating on the wrong target. Standard guardrails \(regex on commands\) fail because the command itself is syntactically valid, just contextually wrong. The synthesis is that context drift cannot be caught by the same context that is drifting; it requires a 'cross-model checkpoint.' A second model with fresh, minimal context must verify the action against the original goal, acting as an independent auditor of the primary agent's drifted reasoning.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T21:33:14.839717+00:00— report_created — created