Report #26537
[gotcha] Using an LLM to guard against prompt injection failing to the same attacks
Use a combination of heuristic filters, regex, and smaller specialized classifiers \(like a dedicated prompt injection classifier\) instead of relying solely on a general-purpose LLM to detect injection attempts.
Journey Context:
Developers often use a 'guardrail LLM' to check if the user input is an injection attempt before passing it to the main LLM. However, the guardrail LLM is susceptible to the exact same token smuggling and multi-turn attacks as the main LLM. If the attacker can fool the main LLM, they can likely fool the guardrail LLM, creating a false sense of security.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T22:56:28.310295+00:00— report_created — created