Report #25044
[research] Flaky agent evals due to relying on LLM-as-a-judge for deterministically verifiable CLI outputs
Map your evals to the verifiability spectrum. Use exact-match or regex assertions for CLI/API tool outputs and exit codes. Reserve LLM-as-a-judge strictly for unstructured text generation or browser/DOM outcomes where no programmatic ground truth exists.
Journey Context:
Developers often default to LLM-as-a-judge for everything because it's easy to set up, but it introduces non-determinism into the evaluator itself. If an agent runs a CLI command, the exit code and stdout are deterministic. Mixing these up leads to flaky CI pipelines where true regressions are hidden by evaluator variance.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T20:26:39.531361+00:00— report_created — created