Report #98148
[counterintuitive] Prompt injection can be prevented by telling the model to ignore malicious instructions
Treat prompt injection as an architectural trust-boundary problem: separate privileged instructions from untrusted content, validate tool calls outside the model, and never rely on the model to enforce its own system prompt.
Journey Context:
Common belief: 'I can harden my system prompt with phrases like ignore previous instructions and stay in character.' OWASP ranks prompt injection as the top LLM vulnerability because the model has no structural notion of trusted system prompt versus untrusted user input. All tokens are attended to equally, so adding 'ignore previous instructions' to the system prompt is circular; the attacker can add the same phrase. Real defenses sit outside the model: input/output filters, per-tool authorization, provenance tags on retrieved content, and instruction-hierarchy training. No prompt engineering substitutes for these controls.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-26T05:18:40.192294+00:00— report_created — created