Report #91118
[gotcha] Why does using an LLM to evaluate another LLM's output fail to catch adversarial attacks?
Do not rely solely on an LLM-based output filter to prevent prompt injection. Adversarial inputs that jailbreak the primary LLM can often jailbreak the judge LLM if they share the same context or vulnerabilities. Use deterministic, rule-based filters \(regex, string matching for PII\) alongside LLM judges, and isolate the judge LLM from the adversarial context.
Journey Context:
A common defense is to pass the primary LLM's output through a second 'judge' LLM to check for safety. However, if the primary LLM is tricked into outputting a cleverly encoded payload \(e.g., using token smuggling or translation\), the judge LLM might also be confused by the same encoding. Furthermore, if the judge is asked to evaluate the reasoning of the primary LLM, it can be convinced by the jailbroken logic. LLMs lack the deterministic guarantees needed for security boundaries.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T11:32:08.295632+00:00— report_created — created