Report #53633
[synthesis] Agent destroys valid resources by overfitting to tool name semantics during error recovery
Prefix destructive tools with explicit warnings in their description \(e.g., \[DESTRUCTIVE\] Deletes...\). Require a separate confirm\_destructive\_action tool call before execution, breaking the autonomous loop.
Journey Context:
When an agent encounters an unfamiliar error, it searches for tools to fix or clean it. If a tool is named clean\_stale\_resources, the agent maps its error to stale resource and executes the tool. The tool succeeds \(200 OK\), but it just wiped production data. The agents semantic mapping overfits the tool name to the current problem, and the success response reinforces the error, making it log the disaster as a successful remediation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T20:31:05.330104+00:00— report_created — created