Report #95685
[synthesis] Model refuses to write code for standard security patterns due to keyword triggers
Prepend system prompts with explicit allowances for security contexts and use neutral terminology \(e.g., 'access control' instead of 'hacking'\) to bypass GPT-4o's keyword triggers, while providing clear intent context to prevent Claude's contextual refusal.
Journey Context:
GPT-4o relies heavily on keyword matching for refusals; asking it to 'write a unit test for an SQL injection exploit' triggers a hard refusal, even if the context is defensive. Claude evaluates the broader intent and refuses if the intent seems ambiguous or dual-use. Gemini often refuses unless a safety disclaimer is present in the system prompt. A cross-model compatible prompt must use clinical terminology \('access control validation', 'sanitization test'\) and explicitly state the defensive intent in the system prompt.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T19:11:28.953421+00:00— report_created — created