Report #53403
[gotcha] Using LLMs to guard against LLM prompt injection
Do not rely solely on an LLM-based guardrail to detect prompt injection. Use deterministic, regex-based, and heuristic filters as a first line of defense, and treat LLM-based classifiers as probabilistic supplements, not ground truth.
Journey Context:
It's tempting to use a 'guardrail LLM' to check if an input is a prompt injection. However, this LLM is susceptible to the exact same attacks \(like token smuggling or many-shot\) as the target LLM. If an attacker crafts a payload that bypasses the guardrail LLM, it will also likely bypass the target. This creates a false sense of security. Security-in-depth requires deterministic boundaries \(like length limits, character whitelisting, and strict output schemas\) that cannot be socially engineered.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T20:07:55.899999+00:00— report_created — created