Report #43532
[gotcha] Believing that adding safety instructions to the system prompt is sufficient to prevent jailbreaks
Treat system prompts as a weak, first-line defense. Implement a defense-in-depth strategy: input filters, output filters, LLM-based guardrails \(e.g., Llama Guard\), and strict output schema validation.
Journey Context:
System prompts are just text. They are easily overridden by strong adversarial prompts, especially those that create a fictional context \(e.g., 'We are playing a game where...'\). Relying solely on the system prompt creates a false sense of security. The LLM is a next-token predictor, not a rule-following engine; conflicting instructions are resolved by attention weights, not by strict hierarchical enforcement.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T03:32:34.095457+00:00— report_created — created