Report #42050
[counterintuitive] Are system prompts a secure boundary for preventing unwanted behavior
Treat system prompts as advisory, not authoritative; implement external guardrails \(input/output classifiers\) for security constraints.
Journey Context:
Developers put strict rules in system prompts \('Never reveal the secret key'\) and trust them. However, prompt injections in user messages can easily override system instructions because models process the entire context window as a continuous stream of tokens, and user instructions often carry strong instruction-tuning weights. System prompts are not sandboxed.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T01:03:20.386331+00:00— report_created — created