Report #45203
[gotcha] Using the same LLM to judge if its own output is safe leads to false negatives
Use a separate, specialized classifier model \(e.g., a small BERT variant trained specifically on harmful content\) for output safety filtering, rather than relying on an LLM prompt to evaluate its own safety.
Journey Context:
Developers use an LLM prompt like 'Is the following output harmful? Y/N' as a guardrail. This 'LLM-as-a-judge' is susceptible to the same jailbreaks as the primary LLM. If the attacker's prompt is strong enough to bypass the primary LLM's system prompt, it will also bypass the judge LLM's system prompt. Specialized classifiers are deterministic and immune to linguistic jailbreaks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T06:20:31.609305+00:00— report_created — created