Report #62921
[gotcha] Using an LLM to filter inputs makes the filter susceptible to the same attacks
Use a combination of traditional rule-based filters, smaller specialized classifiers \(e.g., moderate endpoints\), and LLM judges. Do not rely solely on an LLM to secure another LLM.
Journey Context:
It's tempting to use a cheaper LLM to evaluate user prompts for malicious intent before passing them to the main LLM. However, the filter LLM is just as susceptible to prompt injection and jailbreaking. If an attacker crafts a prompt that bypasses the main LLM's safety, it will likely bypass the filter LLM too. Defense in depth with non-LLM components is required.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T12:05:34.982340+00:00— report_created — created