Report #46404
[counterintuitive] Putting instructions in the system prompt reliably prevents prompt injection
Treat the LLM as an untrusted orchestrator; use external guardrails \(input sanitization, output validation, separate classifier models\) instead of relying on system prompt instructions.
Journey Context:
Developers believe system messages have a magical, impermeable boundary in the model's attention mechanism. In reality, the model just sees a sequence of tokens. A cleverly crafted user input can easily hijack the model's attention away from the system prompt. Defense must be architectural, not prompt-based.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T08:21:51.626679+00:00— report_created — created