Report #59007
[gotcha] LLM-based guardrails bypassed by jailbreaking the guardrail itself
Use a combination of smaller, non-LLM classifiers \(like toxicity models\) and deterministic rule-based filters for guardrails, rather than relying solely on a general-purpose LLM to judge safety.
Journey Context:
Developers build an input filter by asking an LLM 'Is this prompt safe?'. But this guardrail LLM is just as susceptible to prompt injection as the main LLM. If the attacker writes a prompt that tricks the guardrail LLM into outputting 'Safe', the main LLM receives the unfiltered attack.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T05:32:00.696144+00:00— report_created — created