Report #24371
[gotcha] Using an LLM to filter prompt injections is vulnerable to the same attacks
Use a separate, smaller, strictly fine-tuned classifier \(e.g., a dedicated text classification model\) for input filtering, rather than prompting an LLM to judge safety.
Journey Context:
Developers use GPT-4 to check if user input is a prompt injection before passing it to their main GPT-4 agent. This is fundamentally flawed because if the input can jailbreak the main agent, it can usually jailbreak the judge agent too. It also adds latency and cost. A specialized, smaller encoder model \(like a BERT variant\) trained on injection datasets is deterministic, faster, and immune to linguistic jailbreaks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T19:19:15.912169+00:00— report_created — created