Report #43872
[gotcha] Using the same LLM to check if a prompt is malicious fails because it is equally susceptible to the injection
Use a smaller, specialized, strictly-trained classifier model \(e.g., a guardrail model\) for input validation, completely separate from the generative model.
Journey Context:
Developers ask the LLM 'Is this prompt safe?' before executing it. However, if the prompt contains a jailbreak, it will jailbreak the evaluator LLM as well, causing it to output 'Yes, it is safe'. Defense must be asymmetric; the evaluator must be a different architecture or a strict classifier, not an instruction-following LLM.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T04:06:52.395619+00:00— report_created — created