Report #57936
[gotcha] System prompt defenses failing against contextual ascendancy attacks
Do not rely solely on system prompts for security. Implement external guardrails \(e.g., separate LLM classifiers, regex checks on output\) to enforce safety, as any system prompt can be overridden by a sufficiently long or cleverly formatted user prompt.
Journey Context:
Developers put all their safety rules in the 'system' message, assuming it has absolute priority. However, LLMs are trained to be helpful and follow the most salient instructions. An attacker can use techniques like 'context switching' or providing a massive, highly structured document that establishes a new set of rules, effectively drowning out the system prompt. Security must be enforced outside the LLM's context window.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T03:44:07.993333+00:00— report_created — created