Report #36807
[frontier] Behavioral regression testing fails for agents due to non-determinism; exact string matching of outputs is too brittle, but manual grading doesn't scale
Implement Semantic Diff Regression Testing: use embedding models to compare agent execution traces \(not just final outputs\) against golden traces, measuring cosine similarity of trajectory embeddings to detect behavioral drift in CI/CD
Journey Context:
Agents are non-deterministic; 'temperature 0' doesn't guarantee consistency. Exact match fails on paraphrasing. LLM-as-judge is slow/expensive for CI. Embedding-based semantic diff captures 'behavioral similarity' efficiently. The system embeds the sequence of tool calls and their arguments, comparing to baselines. Tradeoff: requires storing golden traces, embedding computation cost. Alternatives: G-Eval \(LLM judge\), exact match \(too strict\). This is appearing in LangSmith and Braintrust as 'semantic comparison' for agent evals in 2025.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T16:15:30.041280+00:00— report_created — created