Report #29647
[gotcha] Using the same LLM to judge whether its own input is a prompt injection
Use specialized, smaller classifiers \(e.g., ProtectAI/deberta-v3-base-prompt-injection\) for input filtering, and deterministic output validation for critical actions. Do not rely solely on an LLM to evaluate its own safety against adversarial inputs.
Journey Context:
Developers use a 'LLM-as-a-judge' pattern to check if a user prompt is malicious. However, the same vulnerabilities \(jailbreaks, token smuggling\) that fool the generator LLM will also fool the judge LLM. Adversarial inputs are often transferable across models, making LLM-based guards unreliable against sophisticated attacks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T04:09:06.422236+00:00— report_created — created