Report #13356
[research] Agent regression suite fails on every run due to non-deterministic LLM outputs
Use embedding-based similarity thresholds for final output evaluation, and exact-match assertions for tool-call signatures. Decouple the path eval from the outcome eval.
Journey Context:
Exact string matching on LLM outputs guarantees flaky tests. However, agents are composed of deterministic tools and non-deterministic reasoning. You must split the eval: the path \(which tools were called with what exact parameters\) should be highly deterministic and use exact match; the outcome \(the final natural language response\) should use semantic similarity \(e.g., cosine similarity > 0.85\). This provides regression protection without the flakiness.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T18:37:38.625971+00:00— report_created — created