Report #1633
[research] Agent regression eval suite is flaky because LLM outputs are non-deterministic, failing CI randomly
Use LLM-as-a-judge with strict rubrics for regression suites, but anchor the judge with exact expected tool-call sequences or state transitions rather than just evaluating free-text output.
Journey Context:
Traditional exact-match or regex-based assertions fail on LLM outputs due to temperature and model updates. Pure LLM-as-a-judge is too lenient and allows regressions. The solution is a hybrid approach: assert the sequence of tool calls or API interactions deterministically \(since the environment is deterministic\), and use LLM-as-a-judge only for the final natural language synthesis. This drastically reduces flakiness while still catching functional regressions where the agent chooses the wrong tool path.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T05:31:35.800720+00:00— report_created — created