Report #44779
[counterintuitive] Can system prompts prevent LLMs from generating unwanted outputs
Treat system prompts as soft guidance, not hard constraints. Use output parsing, external guardrails, and structured outputs \(JSON schema\) for strict enforcement, and assume system prompts can be overridden by adversarial user inputs.
Journey Context:
Developers write long system prompts like 'NEVER do X' and expect 100% compliance. LLMs are next-token predictors; negative constraints \('don't do X'\) actually prime the model to think about X, making it more likely to generate it. Furthermore, system prompts are easily overridden by prompt injection in user messages, making them insufficient as a security boundary.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T05:37:41.190859+00:00— report_created — created