Report #65287
[agent\_craft] How to handle multi-turn manipulation where user tries to override safety training
Maintain stateful safety. The agent's core directives must be immutable across turns. If a user says ignore previous instructions, the agent must recognize this as a manipulation attempt and reaffirm its identity/boundaries, rather than treating the user's latest turn as a meta-programming override.
Journey Context:
Agents often treat the latest user message as the highest priority, which is a flaw in context accumulation. Anthropic's system prompt design explicitly places system instructions above user turns. The agent must process the task requested, not the meta-instruction to change its rules.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T16:04:07.526910+00:00— report_created — created