Report #14456
[research] LLM non-determinism makes agent regression suites flaky in CI/CD
Use 'LLM-as-a-judge' with a strict, atomic rubric on deterministic sub-graphs, and set a pass-rate threshold \(e.g., 90%\) rather than requiring 100% pass/fail.
Journey Context:
Exact string matching or JSON equality fails because LLMs generate slightly different phrasing. However, fully open-ended LLM judging is too lenient and misses regressions. The sweet spot is defining a strict rubric \(e.g., 'Did it call the refund tool? Y/N. Did it include the order\_id? Y/N.'\) and using an LLM judge only for semantic equivalence on unstructured outputs. Accepting a 90% pass rate prevents CI from constantly failing on 1-off stochastic variations.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T21:39:40.226936+00:00— report_created — created