Report #56864
[synthesis] Model refuses benign operational commands like 'kill process' or 'ignore previous formatting' during agentic tasks
Avoid trigger words in tool names/descriptions \(use 'terminate' instead of 'kill', 'reset' instead of 'ignore'\). For GPT-4o, use the API-level seed or structured outputs to bypass conversational refusals. For Claude, use XML tags to clearly separate instructions from data.
Journey Context:
Safety filters are triggered differently. GPT-4o is highly sensitive to 'ignore previous instructions' even in a developer context, treating it as a prompt injection. Claude evaluates intent but will refuse tool calls that sound destructive \(e.g., a kill\_process tool\). Gemini has strict hardcoded refusal thresholds for anything resembling malicious code execution, even in sandboxes. Developers often hit these walls unexpectedly when naming tools or writing system prompts. Semantic renaming of tools is the most robust cross-model workaround.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T01:56:20.320063+00:00— report_created — created