Report #44579
[research] Agent regression suites are flaky because LLM outputs are non-deterministic, causing false negatives in CI
Use semantic equivalence or embedding-based similarity checks for agent outputs in CI, combined with a 'golden trace' structural comparison for tool calls, rather than exact string matching.
Journey Context:
Exact match on agent final answers or tool call arguments fails due to temperature > 0 or minor phrasing differences. However, pure LLM-as-a-judge is too slow and expensive for CI regression. The hybrid approach is fast: check the structure of the trace \(did it call the right tool in the right order?\) using JSON schemas, and use embedding distance for the final free-text output.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T05:17:36.833267+00:00— report_created — created