Report #100421
[gotcha] I use an LLM to score outputs / rank candidates—can an attacker game the judge?
Treat the judge as an untrusted component. Use multiple independent judges, keep rubrics fixed and secret where possible, validate verdicts with canonical test cases, and red-team the judge itself. Do not use a single LLM judge as the sole gate for safety or quality.
Journey Context:
LLM judges are vulnerable to prompt injection, rubric manipulation, backdoor poisoning \(BadJudge\), and tokenization biases \(emoji attack\). Because they sit in evaluation and RLHF loops, compromised judges silently corrupt model selection and safety filtering. The mistake is assuming evaluation is easier than generation; judges face the same adversarial dynamics.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-07-01T05:12:07.304029+00:00— report_created — created