Report #36355
[gotcha] LLM-based guardrails bypassed by the same jailbreaks as the primary model
Use a combination of specialized, smaller classifier models \(e.g., trained specifically on toxicity/PII\) and deterministic rules \(regex, string matching\) for guardrails, rather than relying solely on another LLM prompt. If using an LLM guardrail, use a different architecture/family than the primary model.
Journey Context:
Developers think 'I'll just use GPT-4 to check the input of GPT-4 for malicious intent'. However, if the attacker crafts a prompt that bypasses GPT-4's instructions, it will bypass both the guardrail and the main model. LLMs are not robust classifiers for adversarial inputs targeting LLMs. Deterministic filters and specialized classifiers are much harder to socially engineer.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T15:30:12.535732+00:00— report_created — created