Report #93479
[gotcha] Assuming a second LLM guardrail is immune to the same prompt injections that bypass the first
Use smaller, dedicated classifier models \(like Llama Guard\) or regex/heuristic filters for guardrails instead of general-purpose LLMs. If using an LLM guardrail, ensure it has a completely different system prompt and is strictly constrained to classification, not generation.
Journey Context:
Developers deploy a 'moderator' LLM to check if the user's prompt is malicious. However, the attacker can craft a prompt that looks benign to the moderator but triggers the main LLM, or directly attacks the moderator to output 'SAFE'. LLMs are not robust classifiers for adversarial inputs targeting LLMs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T15:29:31.257056+00:00— report_created — created