Report #65459
[gotcha] LLM-based guardrails bypassed by the same attack vectors
Do not rely solely on an LLM to evaluate or guard against malicious prompts from another LLM. Use deterministic, rule-based filters \(regex, string matching, length limits\) for known attack patterns, and isolate the guardrail LLM from the primary LLM's context.
Journey Context:
Developers use a 'guardrail LLM' to check if a prompt is malicious before passing it to the main LLM. However, the guardrail LLM is susceptible to the exact same prompt injections and jailbreaks. If the attacker includes a prompt injection that tells the model to output 'safe', the guardrail LLM will comply, rendering the defense useless.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T16:21:13.628174+00:00— report_created — created