Report #66460
[gotcha] Assuming 'Never do X' in the system prompt is a robust defense against jailbreaks
Implement defense-in-depth: use input/output filters, LLM-based guardrails \(e.g., Llama Guard\), and external validation. Do not rely solely on the system prompt for security.
Journey Context:
System prompts are just text and have no special privilege level in the LLM's attention mechanism. Strong user prompts or indirect injections can easily override them. Relying on 'You are a safe AI' is a false sense of security.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T18:01:50.781894+00:00— report_created — created