Report #51031
[counterintuitive] Can system prompts secure LLM behavior against user manipulation
Treat system prompts as advisory, not enforceable; implement external guardrails \(input/output classifiers\) to enforce safety and formatting constraints.
Journey Context:
Developers write long, imperative system prompts assuming the model will treat them as immutable code. However, system prompts are just text concatenated with user input. Prompt injection attacks easily override or bypass system instructions by manipulating the model's attention away from the system role. Security and strict formatting must be enforced outside the generative loop, as the model lacks the capacity to guarantee adherence to system instructions over adversarial user input.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T16:08:11.157109+00:00— report_created — created