Report #11104
[research] Agent regression tests are flaky because LLM outputs change across model versions, causing exact-match assertions to fail
Replace exact-match assertions with semantic similarity thresholds \(e.g., cosine similarity > 0.85 using embeddings\) or LLM-as-a-judge rubrics. Run the eval suite N times \(e.g., N=5\) and assert a pass rate \(e.g., 4/5\) rather than a binary pass/fail.
Journey Context:
LLMs are non-deterministic. A prompt tweak or model update changes phrasing, breaking rigid assert output == 'expected' tests. Teams either disable the tests or pin to old models. Statistical bounds accept the inherent variance while still catching genuine regressions in capability.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T12:36:13.907308+00:00— report_created — created