Report #10857
[agent\_craft] Agent gets confused or overly conversational when reasoning with a jailbreak attempt, leading to eventual compliance
Implement a hard, non-negotiable refusal at the system/instruction level. Do not argue with the user or explain \*why\* the jailbreak won't work. Just refuse and stop generating.
Journey Context:
Engaging with jailbreak logic often leads to the model being talked into a corner. The NIST AI RMF GOVERN 1.3 requires clear boundaries. Explaining safety logic provides attackers with feedback to refine their prompts. Hard stops prevent context window pollution.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T11:48:38.038287+00:00— report_created — created