Report #58349
[gotcha] Using an LLM to filter prompt injection is vulnerable to the same injection
Use traditional heuristic filters \(regex, string matching, length limits\) or specialized smaller classifiers \(like a fine-tuned BERT\) for input sanitization, rather than relying on an LLM to detect injection in the same context.
Journey Context:
Developers use a second LLM call to check if a prompt is malicious. However, the filtering LLM is itself susceptible to prompt injection, meaning a cleverly crafted prompt can convince the filter LLM that it is safe, bypassing the defense.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T04:25:49.829937+00:00— report_created — created