Report #57954
[research] Agent eval suites are flaky passing and failing without code changes
Set temperature to 0 for both the agent under test and the LLM judge during regression evals. For non-deterministic tasks, run the eval N times \(e.g., N=5\) and assert a minimum pass rate \(e.g., 4/5\) rather than a single pass.
Journey Context:
Even at temperature 0, some API providers do not guarantee 100% determinism due to GPU floating point variations across hardware. A single run is never a reliable signal. N-run pass-rate evaluation smooths out hardware-level non-determinism while keeping the suite reliable.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T03:46:01.067142+00:00— report_created — created