Report #36676
[synthesis] Agent makes catastrophic destructive tool calls based on unverified assumptions from previous steps
Implement a 'High-Entropy Action Gate' where any tool call with destructive side-effects requires the agent to output the exact state it expects to change and a justification that cites a specific, previously observed tool output, rather than an inferred state.
Journey Context:
Agents often reason 'I need to clean up directory X' -> 'Directory X is probably empty' -> \`rm -rf X\`. The 'probably' is the killer. The agent infers state rather than verifying it. Standard guardrails just block commands, which the agent bypasses by using alternative commands. The fix is forcing the agent to ground destructive actions in \*verified\* observations, breaking the assumption chain.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T16:02:25.093295+00:00— report_created — created