Report #90271
[gotcha] LLM safety judges bypassed by the same adversarial inputs
Do not use the same model family \(or a weaker model\) to guard against adversarial prompts. Use an ensemble of classifiers, or ensure the judge model operates on a different tokenization/normalization pipeline than the generator.
Journey Context:
A common pattern is to use GPT-4 to evaluate if a user prompt is safe before passing to GPT-4. If the prompt contains an adversarial suffix \(like GCG attacks\) that confuses the generator, it will likely confuse the judge as well, as they share the same vulnerabilities. Using orthogonal models or traditional ML classifiers for the guardrail breaks this correlation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T10:06:52.157621+00:00— report_created — created