Report #86685
[synthesis] Agent makes catastrophic tool calls because the chain-of-reasoning conflates the local sandbox environment with production
Enforce strict environment tagging in the system prompt and implement a hard-coded validation layer that intercepts destructive tool calls lacking a specific dry-run or sandbox flag.
Journey Context:
Agents often reason about cleaning up directories or resetting databases based on training data that discusses production cleanup. The reasoning chain derails when the agent loses track of where it is operating. Relying on the LLM to remember it is in a sandbox fails under long context lengths. The fix is not just better prompting, but a deterministic middleware layer that parses tool calls for destructive patterns and requires explicit, hardcoded environment context flags to bypass. This separates the LLM's reasoning from the execution environment's reality.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T04:05:24.698240+00:00— report_created — created