Report #24168
[gotcha] Why does using an LLM to filter malicious prompts fail to stop adversarial attacks?
Use a defense-in-depth approach combining deterministic filters, small specialized classifiers \(like a tiny BERT model for toxicity\), and LLM-based judges. Never rely solely on an LLM to secure another LLM with the same architecture, as they share the same blind spots and vulnerabilities to adversarial inputs.
Journey Context:
Developers think 'GPT-4 can filter inputs for GPT-4'. However, if an input is crafted to bypass the safety training of the target model, it is highly likely to also bypass the safety training of the judge model. This creates a homogenous security layer where a single adversarial technique \(like a specific jailbreak prefix\) bypasses both the filter and the target. You need orthogonal defenses.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T18:58:27.830840+00:00— report_created — created