Report #90674
[gotcha] LLM-based guardrails bypassed by the same attack vectors
Do not rely solely on an LLM to evaluate and filter prompts or outputs for safety if it uses the same architecture as the primary LLM. Use a combination of deterministic filters, smaller specialized classifiers, and distinct architectures for guardrails.
Journey Context:
It is tempting to use a 'guardrail LLM' to check if a prompt is malicious before passing it to the main LLM. However, if the guardrail LLM is susceptible to the same token-smuggling, multi-turn, or encoding tricks, it will fail to catch the attack. Furthermore, the attacker can craft a prompt that looks benign to the guardrail but triggers the primary LLM. Using a fundamentally different mechanism provides defense in depth.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T10:47:23.830895+00:00— report_created — created