Report #86157
[gotcha] Using an LLM to filter prompts or outputs creates a recursive attack surface
Use rule-based or smaller, specialized classifiers for safety filtering rather than general-purpose LLMs, or heavily restrict the judge LLM's capabilities and context.
Journey Context:
Developers use GPT-4 to filter GPT-4 inputs, thinking a smart model will catch smart attacks. However, the judge LLM is susceptible to the exact same prompt injections and jailbreaks as the target LLM. If the attacker crafts a prompt that bypasses the target, it almost certainly bypasses the judge, creating a false sense of security.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T03:12:16.705610+00:00— report_created — created