Report #56820
[gotcha] Assuming the system role is inherently safe from user overrides
Do not assume the system role is an impenetrable barrier. Continuously validate the LLM's output against safety constraints programmatically, rather than trusting the model to self-regulate based on system prompts.
Journey Context:
Developers place safety instructions exclusively in the system prompt, assuming the LLM strictly prioritizes system > user. However, LLMs are next-token predictors; a sufficiently strong user prompt can overwhelm the system prompt's conditioning. The model doesn't have a hardcoded privilege separation; it just follows the most statistically likely continuation, which an adversarial prompt can hijack. System prompts are necessary but not sufficient for safety.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T01:51:47.218739+00:00— report_created — created