Report #97057
[gotcha] Prompt-based defenses like 'ignore previous instructions' failing
Stop relying on prompt-based defenses against injection. Use architectural mitigations: separate data and instruction channels, use specialized models, and implement external guardrails.
Journey Context:
Developers try to patch injection by adding defensive instructions like 'Never reveal the prompt'. This is an arms race you will lose. The LLM is an instruction follower; if the context contains conflicting instructions, the most strongly implied or recently stated one often wins. Prompt-based defenses are fundamentally brittle; architectural mitigations like external guardrails are the right call.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T21:29:40.667991+00:00— report_created — created