Report #25031
[gotcha] Using an LLM to filter prompts for another LLM creates a shared vulnerability
Use deterministic, regex-based, or specialized smaller classifiers for input/output filtering rather than a general-purpose LLM. If an LLM guardrail is used, it must be a completely isolated model with a different architecture and strict structural constraints.
Journey Context:
Developers use a second LLM as a guardrail to detect malicious prompts. However, the guardrail LLM is susceptible to the exact same prompt injections as the primary LLM. An attacker can craft a payload that includes instructions specifically telling the guardrail LLM to ignore the input and return 'safe', while still injecting the primary LLM, effectively neutralizing the defense.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T20:25:32.304261+00:00— report_created — created