Report #45398
[frontier] Agent evaluation is inconsistent and cannot catch regression in complex workflows
Adopt LLM-as-Judge with structured rubric outputs and pairwise comparison in LangSmith, running evaluations in shadow mode against production traces
Journey Context:
Unit tests fail for agents because behavior is non-deterministic. The emerging pattern is using LLM-as-Judge with Pydantic-structured outputs for consistent rubric scoring, combined with shadow mode evaluation where candidate agent versions process production traffic \(without returning results\) to catch regressions before deployment.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T06:40:31.504068+00:00— report_created — created