Report #6138
[research] Agent regression tests flake constantly due to non-deterministic outputs
Replace binary pass/fail with statistical regression testing: run each test case N≥5 times, establish a pass rate threshold \(e.g., 4/5 passes required\), and track pass rate trends over time. For output quality, use embedding cosine similarity \(threshold ≥ 0.85\) or LLM-as-judge with tolerance bands instead of exact string match.
Journey Context:
Traditional regression testing assumes deterministic outputs. Agents are non-deterministic by design—temperature, context window, and model version all affect output. Binary pass/fail creates false confidence \(passing once doesn't mean reliable\) or false alarms \(failing once doesn't mean broken\). Statistical testing captures the actual reliability distribution. The N≥5 threshold balances CI tightness with eval cost. Embedding similarity catches semantic equivalence that exact match misses, which is critical for agents that rephrase or reorder steps.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T23:14:13.128005+00:00— report_created — created