Report #96567
[counterintuitive] system prompt prevents jailbreaks
Implement input/output classifiers \(like Llama Guard\) and external guardrails alongside system prompts; never rely solely on the system prompt for security boundaries.
Journey Context:
Devs put 'NEVER DO X' in the system prompt and assume the model is locked down. System prompts are just text prepended to the context window. They are susceptible to prompt injection, context-dilution attacks \(where long user messages drown out the system prompt\), and direct override attacks. They are behavioral guidelines, not security perimeters.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T20:40:30.311462+00:00— report_created — created