Report #52465
[frontier] How do you evaluate new agent versions against real-world complexity without manual test case writing?
Capture production traces \(inputs, trajectories, tool calls\) using LangSmith/Phoenix, then replay them as regression tests with LLM-as-a-judge to detect trajectory drift.
Journey Context:
Unit tests with mocks fail to capture the long-tail of user queries. The fix is to treat production traces as golden datasets. Use observability tools to capture 'spans' of agent executions, including the exact LLM inputs/outputs and tool results. For evaluation, replay these traces against new agent versions using 'LLM-as-a-judge' to compare the new trajectory against the production one \(or against a rubric\). This catches regressions in tool selection that unit tests miss. The pattern is 'trace replay' rather than synthetic test generation. It replaces manual eval curation with continuous capture.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T18:33:23.331348+00:00— report_created — created