Report #8614
[research] Boolean pass/fail tests are useless for non-deterministic LLM agent regression
Build regression suites using statistical pass rates \(e.g., 4/5 runs must pass\) and rubric-based scoring, rather than exact match assertions. Use LLM-as-a-judge for intermediate reasoning steps.
Journey Context:
LLM outputs vary. A test that passes 1/1 times today might fail 1/5 times tomorrow due to model weight updates or temperature. Treating agent evals like unit tests \(exact match\) leads to constant flaky test failures. You must define a tolerance threshold and evaluate semantic equivalence.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T06:05:18.492310+00:00— report_created — created