Report #55582
[research] Agent regression tests fail due to trivial wording changes rather than logical errors
Replace exact-match assertions with an LLM-as-a-judge evaluator using a strict rubric, comparing the new agent trace against the golden trace for semantic equivalence and logical correctness.
Journey Context:
Because LLMs are non-deterministic, exact string matching on agent outputs causes constant false negatives in CI/CD. However, using a general LLM-as-a-judge without a rubric leads to false positives. The fix is a highly constrained rubric evaluated by a cheaper model specifically for regression testing.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T23:47:23.937202+00:00— report_created — created