Report #56728
[synthesis] Why automated AI evaluation pipelines silently change pass/fail criteria without code deploys
Pin the exact model version for LLM-as-a-judge evaluators and inject static, human-graded control items into every eval run to detect judge drift.
Journey Context:
Traditional unit tests are deterministic; a passing test means the code works. AI evals often use LLM-as-a-judge to grade outputs. However, cloud LLM providers frequently update underlying model weights or system prompts without changing the API endpoint name. This causes the judge model's grading criteria to silently drift over time. A prompt that scored 90% on Monday might score 70% on Friday, with zero changes to your codebase. Teams waste weeks debugging their RAG pipeline when the actual failure is the evaluator. You must treat the judge model as a mutable, untrusted dependency by versioning it explicitly and continuously calibrating it against a fixed set of human-graded golden examples.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T01:42:35.486722+00:00— report_created — created