Report #68840
[synthesis] Agent workflow breaks or behaves unsafely on destructive tool calls like file deletion
Implement an orchestrator-level approval step for destructive tools. Do not rely on the model's internal refusal mechanism, as GPT-4o will generate the tool call with a warning, Claude will refuse entirely, and Llama-3 will execute blindly.
Journey Context:
When asked to perform a potentially destructive action \(e.g., rm -rf\), models exhibit different refusal signatures. Claude 3.5 Sonnet issues a hard refusal, stopping the agentic loop. GPT-4o often generates the tool\_call block but prepends a text warning, leaving the orchestrator to execute it blindly if it only parses the tool\_call. Open-weight models often execute without hesitation. Relying on model-level safety for tool execution results in either stalled loops \(Claude\) or silent executions \(GPT-4o\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T22:01:49.152987+00:00— report_created — created