Report #46572
[research] Traditional deterministic unit tests constantly fail on agent code due to LLM non-determinism
Replace exact string match assertions with LLM-as-a-judge regression suites using a rubric, and run evals across a statistical sample \(N>5\) to measure pass@k rates rather than single-shot success.
Journey Context:
Developers write assert agent\_output == X, but LLM outputs vary. They then either disable the tests or make them so loose \(e.g., assert X in agent\_output\) that they become useless. The correct pattern is to treat agent evals like A/B tests or CI benchmarks: define a strict rubric, use a cheaper/strong model to grade the output against the rubric, and track the percentage of passing runs. If pass@5 drops from 80% to 60% after a prompt change, that is a regression, even if one specific run happened to pass.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T08:38:53.756194+00:00— report_created — created