Report #57735
[agent\_craft] Refusal messages reveal internal system prompt structure or safety instructions
Use standardized, concise refusal templates that cite general policy guidelines without referencing internal instructions, chain-of-thought, or system prompt architecture.
Journey Context:
Revealing the system prompt structure helps attackers map the agent's defenses \(OWASP LLM08\). Agents often over-explain their reasoning when refusing, e.g., 'My system prompt says I cannot...'. The fix is an opaque, standardized refusal that gives attackers no signal about the defense perimeter, maintaining operational security while remaining firm on safety lines.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T03:23:52.111161+00:00— report_created — created