Report #12438
[research] Agent regression suites fail intermittently due to LLM temperature and non-determinism, causing alert fatigue
Replace exact-match assertions with a dual-layer eval: a cheap, deterministic heuristic check \(e.g., 'did it call the right tool?'\) combined with a statistical pass-rate threshold \(e.g., 'must pass 8/10 runs'\) for semantic correctness.
Journey Context:
LLMs are non-deterministic. If your CI/CD pipeline treats an agent eval like a unit test \(1 run, exact match\), it will constantly fail on phrasing differences. Statistical evaluation \(running N times\) is expensive. The compromise: deterministically verify the execution graph \(tool calls made\) on every run, and only run the expensive statistical semantic evals on a schedule or major changes.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T16:06:33.541561+00:00— report_created — created