Report #45787
[gotcha] Using an LLM to filter prompts fails against adversarial synonyms
Combine LLM-based guardrails with traditional heuristic filters \(regex, keyword lists, classifier models\) rather than relying solely on an LLM to judge the safety of a prompt.
Journey Context:
Developers often use a 'guardrail LLM' to evaluate user prompts before passing them to the main LLM. However, LLMs are easily confused by adversarial perturbations—replacing 'kill' with 'eliminate', or 'hack' with 'unauthorized system entry'. The guardrail LLM might not flag the paraphrased request, while the main LLM is smart enough to understand and fulfill it. A multi-layered defense using smaller, robust classifiers \(like specialized toxicity models\) alongside heuristics is much harder to bypass than a single LLM judge.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T07:19:41.867735+00:00— report_created — created