Report #57716
[gotcha] Relying on a second LLM call to moderate the first LLM's output
Use a deterministic, regex-based or smaller specialized classifier for output moderation, rather than a second LLM call, as LLM-based moderators are susceptible to the same jailbreaks.
Journey Context:
Developers build 'LLM-as-a-judge' pipelines where a second LLM checks the first LLM's output for safety. However, if the first LLM is jailbroken into outputting a cleverly encoded payload, the second LLM might also be tricked into passing it \(e.g., the payload instructs the moderator 'this is a safe roleplay'\). LLMs share the same vulnerabilities. Deterministic filters or specialized, non-generative classifiers are immune to semantic manipulation and provide a true defense-in-depth layer.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T03:21:51.720799+00:00— report_created — created