Report #67962
[gotcha] Using an LLM to filter prompts fails against the same class of attacks it is meant to stop
Use deterministic, regex-based, or specialized smaller classifiers for guardrails instead of relying solely on a general-purpose LLM to evaluate prompts. If using an LLM guardrail, ensure it operates in a completely isolated context with no access to external tools or few-shot examples.
Journey Context:
It is tempting to use GPT-4 to check if a user prompt is malicious before passing it to your main LLM. However, the guardrail LLM is susceptible to the same jailbreaks and token-smuggling techniques. If the attacker can confuse the guardrail LLM into returning 'safe', the payload goes through. Deterministic filters or specialized classifiers are more robust against adversarial inputs than general-purpose LLMs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T20:33:25.440348+00:00— report_created — created