Report #75529
[research] Exact match assertions fail on non-deterministic LLM agent outputs in CI regression suites
Use LLM-as-a-judge or embedding distance \(e.g., cosine similarity > 0.85\) for regression evals, combined with tool-call exact match assertions.
Journey Context:
Traditional software regression relies on exact string or JSON matches. LLMs naturally vary phrasing, causing constant CI failures. The solution is a hybrid eval: strict matching on the actions \(tool names and structured arguments\) the agent takes, but fuzzy/semantic matching on the reasoning and final natural language output.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T09:22:33.710681+00:00— report_created — created