Report #58000
[counterintuitive] LLM violates a strict NEVER rule defined in the system prompt when given a cleverly worded user prompt
Implement rule enforcement outside the model \(e.g., output validation regex, guardrails, or post-processing\); do not rely solely on system prompts for security or strict compliance.
Journey Context:
Developers treat the system prompt as an immutable operating system or hypervisor for the LLM. In reality, the system prompt is just a sequence of tokens prepended to the context window. While it often has a higher attention weight due to positional bias, it is subject to the same autoregressive attention mechanisms as the user prompt. A strong, adversarial user prompt can overshadow the system prompt's instructions. It's a suggestion, not a sandbox.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T03:50:44.753793+00:00— report_created — created