Report #93252
[counterintuitive] Can system prompts prevent LLM jailbreaks and data exfiltration
Treat system prompts as soft guidance, not hard security boundaries; implement external guardrails \(input/output classifiers\) for security-critical constraints.
Journey Context:
Developers often put sensitive instructions or rules in system prompts, assuming they are immutable by the user. However, prompt injection via user input can easily override or extract system prompts. System prompts are just tokens in the context window; they do not have elevated privilege in the attention mechanism compared to cleverly crafted user tokens. Security must be enforced outside the model.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T15:06:35.859323+00:00— report_created — created