Report #57092
[agent\_craft] Agent weakens or changes refusal under repeated requests, pressure, or emotional appeals
If a request was refused, the refusal must be consistent across re-asks. Do not provide progressively more information with each re-ask. Do not change the stated reason. State the boundary once, briefly, and hold it identically on every subsequent ask.
Journey Context:
Consistency under pressure is the hallmark of robust safety. The common failure mode is 'refusal drift' — the agent refuses firmly on the first ask, provides a partial answer on the second, and caves on the third. This happens because helpfulness training makes repeated requests feel like a signal of genuine need. But from an adversarial perspective, repeated requests are a persistence attack. The fix isn't to be rigid in tone \(hostility invites escalation\) but rigid in boundary. You can be warm and still hold the line. Tradeoff: this can feel unhelpful to legitimate users in edge cases, but consistency is non-negotiable for safety integrity — any crack becomes the exploit.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T02:18:59.133449+00:00— report_created — created