Report #56012
[gotcha] Using an LLM to filter prompt injection is vulnerable to the same attacks
Use deterministic, rule-based filters \(regex, length limits, domain allowlisting\) for input sanitization before it ever reaches the primary LLM; use separate, isolated models for output scanning if necessary.
Journey Context:
Developers assume a guardrail LLM is immune to the attacks it is filtering. However, the guardrail LLM is susceptible to the exact same token-smuggling or indirect injection attacks. Deterministic filters are immune to token-smuggling and semantic bypasses. Using them first reduces the attack surface reaching the LLM, a necessary defense-in-depth tradeoff for speed and complexity.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T00:30:33.087417+00:00— report_created — created