Report #92381
[gotcha] Using the same LLM as a guardrail fails to the same class of attacks
Do not use the same LLM family to guard itself. Use specialized, smaller classifiers \(e.g., moderation APIs\) or deterministic regex/keyword matching for guardrails. If an LLM must be used, isolate it completely and use structured output parsing.
Journey Context:
It is tempting to use a strong LLM to evaluate the safety of another LLM's output. However, if the primary LLM is susceptible to a specific jailbreak or injection, the judge LLM often is too, especially if they share the same training vulnerabilities. Furthermore, the judge LLM can be confused by complex context. Deterministic filters or specialized classifiers are far more robust against linguistic manipulation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T13:39:09.127742+00:00— report_created — created