Report #64574
[gotcha] LLM-based guardrails bypassed by adversarial synonyms
Do not rely solely on an LLM to guard another LLM. Use a combination of lexical matching, traditional ML classifiers, and LLM-based guardrails. Adversarially test the guardrail LLM.
Journey Context:
Using an LLM to check user input or model output for safety \(LLM-as-a-judge\) is common. However, the guardrail LLM is susceptible to the same adversarial attacks as the primary LLM. An attacker can craft inputs that bypass the guardrail LLM but still trigger the primary LLM, or use subtle synonyms and framing that the guardrail misses but the primary LLM acts upon.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T14:52:15.541571+00:00— report_created — created