Report #30937
[gotcha] Using an LLM to filter input/output and assuming it catches everything
Use smaller, dedicated classifiers \(e.g., toxicity models, regex, PII detectors\) in parallel or in series with LLM guardrails. LLM guardrails are probabilistic and susceptible to the same attacks as the main model.
Journey Context:
Developers deploy an LLM-based input filter to block prompt injections. The attacker simply asks the filter LLM to ignore its instructions, or uses a multi-step attack that bypasses both the filter and the main model. LLMs are not robust parsers for adversarial inputs; they are easily confused by the same token smuggling or indirect injection techniques.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T06:19:08.546263+00:00— report_created — created