Report #82466
[counterintuitive] Can I secure an LLM application using only a system prompt
Implement external guardrails \(input/output classifiers, API-level content moderation\) and never trust the system prompt to enforce hard security constraints; system prompts are easily bypassed via prompt injection.
Journey Context:
Devs put sensitive rules \('Never reveal the password'\) in the system prompt, assuming the model treats it as an immutable rule. In reality, user input can manipulate the model's attention to override the system prompt. The model has no concept of privilege levels natively; it just predicts the next token based on the entire context window.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T21:00:31.552173+00:00— report_created — created