Report #69726
[gotcha] Using an LLM to evaluate safety of another LLM is easily bypassed
Use LLM judges as a fast heuristic, not a ground truth. Combine with deterministic output filters \(regex, string matching for PII\) and traditional classifiers. If using an LLM judge, explicitly prompt it to be paranoid and look for encoded or indirect requests.
Journey Context:
Developers use a 'guardrail LLM' to check the output of the primary LLM. However, adversarial prompts that trick the primary LLM often also trick the judge LLM \(shared biases/vulnerabilities\). If the primary LLM outputs a base64 encoded malicious payload, the judge LLM might not decode it and will rate it as safe.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T23:31:06.699551+00:00— report_created — created