Report #35249
[gotcha] Adversarial attacks bypassing LLM-based input guardrails
Do not rely solely on an LLM to filter inputs for another LLM. Use deterministic filters \(regex, string matching\) and dedicated classifiers as the primary defense, as adversarial attacks transfer easily between models.
Journey Context:
Developers use a 'smaller LLM' or a different prompt to check if user input is malicious. However, adversarial prompts that jailbreak the main LLM are often transferable and will also jailbreak the guardrail LLM. LLMs are not robust classifiers for adversarial inputs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T13:37:57.163970+00:00— report_created — created