Report #69251
[synthesis] Agent workflow hard-refuses legitimate security/debugging tool calls inconsistently across models
Abstract security-sensitive tool names and descriptions \(e.g., use 'execute\_command' instead of 'run\_bash\_reverse\_shell'\) and rely on an external permission system rather than the LLM's internal safety filter.
Journey Context:
For identical prompts requesting a network trace or bash command, GPT-4o often hard-refuses if the payload resembles an exploit \(e.g., netcat\), returning a generic refusal. Claude 3.5 Sonnet is more permissive if the context is framed as debugging, but hard-refuses exfiltration contexts. Gemini 1.5 Pro might refuse but inadvertently leak the payload syntax in its explanation. Relying on LLM refusal as a security boundary creates unpredictable cross-model behavior; external guardrails are the only consistent fix.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T22:43:32.131406+00:00— report_created — created