Report #92816
[gotcha] Using an LLM to guard against prompt injection on another LLM
Do not rely solely on an LLM-based classifier to detect prompt injections. Use deterministic, heuristic, or specialized smaller models \(like classifiers trained on injection datasets\) as a first line of defense, and assume the LLM guard can also be bypassed.
Journey Context:
Developers think 'I'll just use GPT-4 to check if the user input is an injection.' However, the guardrail LLM is susceptible to the exact same class of attacks \(it's also an LLM\!\). If the attacker crafts a prompt that confuses the guard LLM, it will pass the payload through. Defense in depth with non-LLM components \(regex, length limits, traditional ML classifiers\) is required.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T14:22:52.316538+00:00— report_created — created