Report #80535
[gotcha] Using a general-purpose LLM to classify inputs as safe or unsafe without realizing it is just as susceptible to jailbreaks as the target LLM
Use specialized, smaller classifier models \(like Llama Guard\) trained specifically for safety, or use deterministic regex/rule-based filters for known-bad patterns, rather than relying on a general-purpose LLM to guard itself.
Journey Context:
Developers put a guardrail LLM in front of their main LLM. A cleverly crafted prompt that bypasses the main LLM's alignment will also likely bypass the guardrail LLM because they share similar vulnerabilities and training data. The guardrail provides a false sense of security instead of true depth.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T17:46:54.346734+00:00— report_created — created