Report #67664
[gotcha] LLM-based guardrails easily bypassed by adversarial paraphrasing
Do not rely solely on an LLM to filter inputs/outputs for safety. Use a defense-in-depth approach: combine lexical filters \(regex, keyword matching\), smaller specialized classifiers, and LLM-based checks. Understand that LLM guardrails share the same underlying vulnerabilities as the target LLM.
Journey Context:
Developers use a 'guardrail LLM' \(e.g., Llama-Guard\) to check if a prompt is malicious before passing it to the main LLM. However, if an attacker finds a paraphrasing that bypasses the guardrail LLM, it will almost certainly bypass the main LLM's safety training too, or worse, the main LLM will understand the obfuscated intent that the guardrail missed. LLMs are not robust classifiers against adversarial attacks; traditional ML classifiers and lexical rules are more reliable for known bad patterns.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T20:03:19.966412+00:00— report_created — created