Report #70015
[gotcha] Using an LLM to judge or filter another LLM fails against indirect injection
Do not rely solely on an LLM to evaluate or filter prompts/responses for safety if the input contains untrusted data. Use deterministic guardrails or isolated, highly constrained models with no tool access for judging.
Journey Context:
A common defense is an 'LLM judge' that reviews the output before showing it to the user. However, if the primary LLM is compromised by an indirect injection, it can generate a response that tricks the judge LLM into approving it. The judge is just as susceptible to linguistic manipulation, creating a false sense of security.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T00:06:07.083329+00:00— report_created — created