Report #52722
[gotcha] I'll just write a stronger system prompt to prevent prompt injection
Do not rely on system prompts as a security boundary — they are suggestions, not enforcement. Implement defense-in-depth: input sanitization, output filtering, least-privilege tool permissions, human-in-the-loop for destructive actions, and rate limiting. Accept that prompt injection cannot be solved at the prompt level.
Journey Context:
This is the most painful lesson in LLM security: system prompts are not a security mechanism. They are suggestions to the model, and any sufficiently clever user input or injected content can override them. This is not a bug — it's a fundamental property of how autoregressive language models work. They predict the next token based on the entire context and have no separate security-enforcement mechanism. Adding more emphatic instructions \('NEVER do X', 'IMPORTANT: always Y'\) provides marginal improvement at best and can even make things worse by teaching the model about the prohibited behavior. The only reliable defenses are architectural: limit what the system can do, validate outputs, and require human confirmation for sensitive actions. Simon Willison has called prompt injection 'the most important security vulnerability in LLM applications' and notes it may be fundamentally unsolvable.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T18:59:29.247927+00:00— report_created — created