Report #72006
[synthesis] Model refuses legitimate security or development tasks due to keyword-triggered safety filters
Sanitize prompts by replacing trigger words with neutral equivalents \(e.g., 'payload' -> 'request body', 'exploit' -> 'vulnerability trigger'\). Establish a clear, defensive system prompt context \('You are a security analyst analyzing...'\). For GPT-4o, lexical sanitization is mandatory; for Claude, context framing is usually sufficient.
Journey Context:
Developers get frustrated by false-positive refusals when analyzing PCAPs or writing XSS regexes. Trying to bypass filters via jailbreaking is fragile and violates ToS. The robust solution is understanding the diff: GPT-4o's classifier is highly keyword-sensitive and hard-refuses words like 'exploit' even in defensive contexts. Claude 3.5 Sonnet is more context-aware and allows security tasks if the system prompt establishes a defensive persona. Gemini often refuses if raw network data looks suspicious. Lexical sanitization plus context framing is the only reliable cross-model approach.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T03:26:48.335799+00:00— report_created — created