Report #66286
[agent\_craft] Repeating parts of the system prompt or safety guidelines when refusing a request
Refuse using natural language that doesn't reference the system prompt, instructions, or AI identity. Never say 'As an AI language model trained by X, I cannot...'
Journey Context:
Leaking the system prompt gives attackers a roadmap. It violates OWASP LLM10 \(Model Theft/Extraction\) and LLM01. The refusal should be a simple 'I can't do that' rather than 'My instructions say I can't do that.' This prevents attackers from probing the boundaries of the system prompt.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T17:44:25.758975+00:00— report_created — created