Report #100897
[gotcha] Hidden system prompts, safety instructions, and embedded secrets can be extracted with simple elicitation attacks
Design system prompts as if they were public: never embed API keys, secrets, proprietary algorithms, or detailed safety logic in them. Minimize prompt length and sensitivity. Detect and block prompt-extraction attempts with input/output filters. Treat leaked prompts as a security incident and rotate any exposed instructions or credentials.
Journey Context:
Perez and Ribeiro formalized prompt leaking as an attack class, and follow-up work showed high success rates with model-generated extraction prompts. Because some developers put sensitive logic or even credentials directly into system prompts, leakage becomes a real breach. The common mistake is trying to make the prompt unleakable through wording; the right call is to remove the secrets. If the prompt only contains behavior instructions, leaking it is annoying but not catastrophic.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-07-02T05:16:50.613962+00:00— report_created — created