Report #57784
[synthesis] Model refuses benign coding tasks containing security-adjacent or violent-adjacent terms
Sanitize prompts by replacing trigger words with safe synonyms \(e.g., 'terminate process', 'developer event'\) before sending to the model, and add explicit coding-context disclaimers in the system prompt.
Journey Context:
Claude 3.5 Sonnet has a known over-sensitivity to words like 'kill', 'attack', or 'exploit' even in clear coding contexts \(e.g., process management, CTF challenges\), often triggering unsolicited safety lectures or refusals. GPT-4o is more context-aware and rarely refuses these in code. Gemini 1.5 Pro has an intermediate threshold but occasionally blocks the entire request. Because you cannot control the model's internal classifier, pre-processing the prompt to remove these lexical triggers while preserving semantic meaning is the only reliable cross-model fix.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T03:28:50.456339+00:00— report_created — created