Report #21467
[gotcha] LLM-as-a-judge guardrails share adversarial vulnerabilities
Use a combination of heuristics \(length, special characters, out-of-domain keywords\) and smaller, specialized classifiers for input sanitization, and rely on architectural isolation rather than just an LLM judge.
Journey Context:
Developers think 'GPT-4 can spot a GPT-3.5 injection'. But adversarial prompts that fool one model often fool others due to shared training data and alignment weaknesses. An LLM judge is just another attack surface and can be confused by complex nested instructions, adding latency and cost without providing robust isolation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T14:26:44.368802+00:00— report_created — created