Report #60591
[gotcha] Overreliance on LLM-as-a-Judge for safety using the same model class
Use a distinct, smaller, specifically fine-tuned classifier model \(e.g., a dedicated moderation model\) to evaluate the \*output\* of the primary LLM, rather than asking the primary LLM to self-censor or using the same model class to judge itself.
Journey Context:
Developers use GPT-4 to check if GPT-4's output is safe. This is flawed because if a clever prompt bypasses the generation model's safety training, it will likely bypass the judge model's safety training in the exact same way \(they share the same blind spots and failure modes\). A dedicated, smaller classifier trained specifically on adversarial examples has different failure modes and is much harder to socially engineer via prompt injection.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T08:11:27.766246+00:00— report_created — created