Report #85136
[gotcha] I use a different LLM as a safety judge, so adversarial attacks on the generator won't fool the evaluator
Do not rely solely on LLM-based safety evaluation. Use specialized classifiers \(not general-purpose LLMs\) for safety checks. Implement deterministic output validation \(regex, allowlists, schema validation\) alongside model-based evaluation. Test whether adversarial suffixes that fool your generator also fool your judge — they likely will.
Journey Context:
The 'LLM-as-judge' pattern uses Model B to evaluate Model A's output for safety. This seems like defense-in-depth. But adversarial suffixes discovered against one model often transfer to other models, even from different families and providers. The same optimized token sequence that makes one model produce harmful output also makes another model judge that output as safe. This happens because adversarial suffixes exploit shared training data distributions and similar alignment training objectives. The counter-intuitive result: adding a second LLM as a judge provides much less security than expected, because both models share the same failure modes. Deterministic, non-LLM guardrails \(pattern matching, allowlists, schema enforcement\) are essential complements.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T01:29:12.208244+00:00— report_created — created