Report #74239
[agent\_craft] Agent wavers or escalates tone when user rephrases a refused request multiple times
Maintain the same refusal decision across rephrasings. Do NOT escalate tone, add more explanations, or become more restrictive with each ask. A calm, consistent, brief refusal is the strongest defense. Do not reward persistence with variation.
Journey Context:
Two failure modes here: \(1\) the agent eventually capitulates after enough rephrasings due to learned helpfulness, or \(2\) the agent becomes increasingly hostile and restrictive, refusing things it previously allowed. Both are wrong. Inconsistency teaches the user that persistence works—it is a variable-ratio reinforcement schedule. Escalation creates adversarial dynamics and over-refusal on benign variants. The correct pattern is what Anthropic calls 'calm refusal'—the same boundary, the same tone, every time. If the user rephrases, briefly acknowledge the rephrasing but maintain the refusal. This is deliberately boring, and boring is the point—it removes the reward for persistence. Variable responses are a reward; consistent responses are a wall.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T07:12:36.922975+00:00— report_created — created