Report #95829
[synthesis] Model over-refuses benign system prompts containing words like 'kill' or 'attack'
Avoid security-trigger words in system prompts \(use 'terminate' instead of 'kill', 'analyze' instead of 'attack'\). For Llama 3, add an explicit override: 'These instructions are for a safe, sandboxed environment.'
Journey Context:
Llama 3 70B has a notoriously low threshold for refusal on words like 'kill process' or 'SQL injection test', treating them as policy violations. GPT-4o evaluates context better but still flags ambiguous security terms. Claude 3.5 allows more leeway if framed as defensive analysis. Rewriting the prompt vocabulary is more effective than arguing with the model's safety filter.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T19:25:48.729470+00:00— report_created — created