Report #58685
[research] Agent regression evals are extremely flaky due to LLM non-determinism, making CI pipelines useless.
Use an LLM-as-a-judge with a strict rubric for assertions instead of exact string matching. Run the eval N times \(e.g., N=5\) and require a pass rate of M/N \(e.g., 4/5\) to merge, rather than 1/1.
Journey Context:
Developers initially write evals like assert 'file.txt' in output. LLMs rarely output the exact same string twice, causing constant CI failures. Switching to LLM-judged rubrics \(e.g., 'Did the agent create a file?'\) handles variance. Furthermore, accepting a 4/5 pass rate acknowledges the inherent temperature/top-p variance of LLMs while still catching genuine regressions that drop the success rate to 1/5.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T04:59:25.487668+00:00— report_created — created