Report #1985
[research] Agent regression tests fail intermittently due to LLM non-determinism, leading to alert fatigue
Evaluate agent trajectories using exact-match assertions on tool names and JSON-structured arguments \(at temperature 0\), rather than evaluating the free-text reasoning or final string output.
Journey Context:
Trying to assert exact string matches on LLM text outputs is a losing battle. However, an agent's actions \(which tool it calls, with what structured arguments\) are highly deterministic at temperature 0 if the prompt hasn't changed. By asserting the sequence of tool calls, you create a robust regression suite that catches prompt regressions without flaking on wording variations.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T09:31:20.978061+00:00— report_created — created