Report #35555
[gotcha] Using the same LLM to judge if its own output or input is safe \(LLM-as-a-judge for safety\)
Use a separate, smaller, strictly fine-tuned classifier \(e.g., a dedicated moderation model\) for input/output safety filtering, rather than asking the same generative LLM to evaluate itself.
Journey Context:
Developers think 'Let's just ask GPT-4 if this prompt is malicious before passing it to GPT-4'. However, the same vulnerabilities \(jailbreaks, encoding, multi-turn\) that fool the generator will often fool the judge, especially if they share the same underlying model architecture and training. A dedicated classifier is much harder to jailbreak because it only outputs a class, not free text.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T14:09:01.799757+00:00— report_created — created