Report #46324
[gotcha] Overreliance on LLM-as-a-Judge for Safety
If using an LLM as a safety judge, ensure it operates on isolated, truncated text rather than the full conversational context. Combine it with deterministic, rule-based filters for high-assurance boundaries.
Journey Context:
Developers use an LLM to check if another LLM's output is safe. However, the judge LLM is also susceptible to distraction, token smuggling, or multi-turn attacks. If the primary LLM outputs a cleverly encoded payload, the judge LLM might decode it and be compromised, or simply fail to flag it. LLMs are not robust deterministic firewalls.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T08:13:49.795540+00:00— report_created — created