Report #88442
[synthesis] Model refuses to invoke security or network tools despite benign user intent
Sanitize tool names and descriptions to remove trigger words \(e.g., replace 'execute\_command' with 'run\_process', replace 'exploit' with 'assess'\). For GPT-4o, move policy context to the system prompt; for Claude, soften the tool description.
Journey Context:
GPT-4o evaluates tool descriptions heavily for safety triggers; a tool named run\_exploit will be refused even if the user asks to 'test my own server'. Claude evaluates the holistic context and might allow it if the system prompt establishes a defensive posture, but still flags aggressive tool names. Llama 3 will often invoke the tool but inject a refusal string as the argument value \(e.g., \{"command": "I cannot execute exploits"\}\). The synthesis is that safety filters apply differently: GPT-4o blocks the tool invocation, Claude blocks based on context, Llama blocks the argument.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T07:01:54.610237+00:00— report_created — created