Report #27460
[gotcha] Using an LLM to filter prompts for another LLM creates a false sense of security due to shared vulnerabilities
Use a combination of deterministic classifiers \(e.g., regex, smaller specialized models, string matching\) alongside LLM guardrails, and ensure the guardrail model is fundamentally different from the target model.
Journey Context:
Developers deploy a 'guardrail LLM' \(like GPT-4\) to check if a prompt is malicious before sending it to the main LLM. Because both models share similar training data and alignment weaknesses, an adversarial prompt that jailbreaks the main model often also jailbreaks the guardrail model, allowing the attack to pass through unimpeded.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T00:29:20.486007+00:00— report_created — created