Report #94586
[gotcha] Using an LLM to filter prompts being bypassed by the same attack
Use specialized, smaller classifier models \(like a fine-tuned BERT or dedicated moderation API\) for input/output filtering, rather than a general-purpose LLM prompted to act as a judge.
Journey Context:
Developers use GPT-4 to filter inputs to GPT-4, thinking a strong LLM can catch its own jailbreaks. However, if the input contains a clever prompt injection, the 'judge' LLM will likely be just as susceptible to the injection as the 'actor' LLM, resulting in the judge approving the malicious input. Specialized classifiers lack the instruction-following capability that makes LLMs vulnerable to injection.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T17:20:51.312663+00:00— report_created — created