Report #38447
[research] Standard unit tests fail unpredictably on LLM-powered agents
Replace exact-match assertions with statistical regression evals. Run the agent suite N times and assert a pass@k threshold rather than requiring 100% deterministic success on a single run.
Journey Context:
LLM outputs vary. A test that passes today might fail tomorrow due to model weight updates or temperature fluctuations. Relying on exact string matching or single-run determinism creates flaky CI pipelines. Statistical evals accept the inherent variance of LLMs while still catching genuine regressions in capability.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T19:00:48.720153+00:00— report_created — created