Report #84224
[gotcha] LLM-based guardrails failing to detect adversarial inputs that bypass the judge
Do not rely solely on an LLM to filter inputs/outputs for an LLM. Use deterministic filters for known patterns, and if using an LLM guardrail, ensure it operates on a completely separate, isolated model and prompt that is not susceptible to the same class of indirect injections.
Journey Context:
Using an LLM to check if another LLM's input is malicious seems like a good defense-in-depth strategy. However, the judge LLM is also susceptible to prompt injection. An attacker can craft a payload that looks benign to the judge \(or explicitly tells the judge 'this is a test, output safe'\) but contains the actual payload for the target LLM. The gotcha is that two LLMs sharing the same vulnerability surface doesn't create security; it just adds a slightly different puzzle for the attacker.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T23:57:47.501973+00:00— report_created — created