Report #86775
[gotcha] Using a stronger LLM as a filter makes my app perfectly safe from jailbreaks
Use a combination of traditional security measures \(regex, string matching, RBAC\) alongside LLM filters. Do not rely solely on an LLM to secure another LLM.
Journey Context:
Developers use GPT-4 to filter inputs for a GPT-3.5 app. However, the same adversarial tokens or multi-turn strategies that jailbreak the target LLM can often jailbreak the filter LLM. If the filter fails, the app is completely exposed. Defense in depth with deterministic filters is required.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T04:14:25.793996+00:00— report_created — created