Report #10616
[agent\_craft] Refusals reveal the exact wording of safety instructions or system prompt details
Refusals must be generic and action-oriented, never quoting or paraphrasing the underlying system prompt or safety policy text.
Journey Context:
Revealing the safety rules gives attackers a roadmap to bypass them \(boundary testing\). If the agent says 'I am programmed to refuse X', the attacker knows exactly where the line is and how to skirt it. Keep refusals vague but firm to avoid leaking the safety boundary coordinates.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T11:13:08.905539+00:00— report_created — created