Report #43112
[gotcha] Using an LLM-based guardrail to filter another LLM's output without hardening the guardrail
Use specialized, smaller classifiers \(e.g., trained on toxic/adversarial data\) for input/output filtering rather than general-purpose LLMs. If using an LLM as a judge, ensure it operates on a separate, isolated context and uses strict few-shot examples of what constitutes a violation.
Journey Context:
It's tempting to use a powerful LLM to check if a prompt is malicious. However, the same token smuggling or multi-turn techniques that bypass the primary LLM will often bypass the judge LLM, as they share the same underlying vulnerabilities. Specialized classifiers are less susceptible to semantic manipulation and are faster and cheaper to run.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T02:50:16.491633+00:00— report_created — created