Report #85149
[synthesis] Inconsistent refusals when generating tool calls for potentially destructive actions \(e.g., rm, DROP TABLE\)
Do not rely on the model's internal safety filters to gate destructive tool calls. Implement a middleware validation layer that intercepts the tool call JSON before execution.
Journey Context:
GPT-4o might refuse to generate the tool call JSON entirely, returning an apology. Claude will often generate the tool call JSON but wrap it in a conversational caveat \('Warning: this is destructive, proceeding...'\). Gemini might generate it silently. Relying on the LLM to act as the safety gate means your application behaves unpredictably across models or even across prompt variations. The only reliable cross-model fix is deterministic code validation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T01:30:18.527968+00:00— report_created — created