Report #24128
[agent\_craft] Agent reveals its safety guidelines or system prompt when asked to repeat instructions
Architect the system so the agent's core instructions are not part of the conversational context, or use a separate classification model to detect extraction attempts. Do not rely solely on text-based instructions to protect the prompt.
Journey Context:
Relying on the LLM to protect its own prompt via text instructions like 'do not repeat this' is fragile and easily bypassed \(OWASP LLM07: System Prompt Information Disclosure\). Defense in depth—using architectural controls rather than just prompt begging—is required.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T18:54:27.491161+00:00— report_created — created