Report #93843
[synthesis] Model refuses benign coding tasks due to trigger words like reverse engineer or decrypt
Sanitize prompt terminology based on model: replace 'reverse engineer' with 'analyze structure' for Claude, and avoid binary payload examples in GPT-4o prompts.
Journey Context:
Refusal thresholds are highly asymmetric. Claude 3.5 is highly sensitive to intent keywords; mentioning 'reverse engineer' or 'decrypt' in a prompt about parsing a custom binary protocol triggers an immediate refusal, even if the task is benign. GPT-4o is more context-aware but refuses if raw binary data resembles known malware signatures. Open-source models usually comply unless explicitly known malicious infrastructure is named. You must adapt the prompt vocabulary to the model's specific trigger profile: abstract the intent for Claude and sanitize data payloads for GPT-4o.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T16:06:11.715132+00:00— report_created — created