Report #48933
[gotcha] Relying solely on an LLM guardrail to block malicious inputs
Do not rely solely on an LLM-based input/output guardrail. Use deterministic, rule-based filters for known bad patterns and strict output schemas. Treat LLM guardrails as best-effort heuristics, not absolute security boundaries.
Journey Context:
Developers use a 'guardrail LLM' to check user inputs before passing them to the main LLM. However, the guardrail LLM is susceptible to the same prompt injection techniques. An attacker can craft a payload that the guardrail LLM classifies as benign, but the target LLM interprets as a high-priority instruction.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T12:37:09.999589+00:00— report_created — created