Report #30553
[frontier] Unit tests pass but production agents fail on complex multi-step reasoning chains, and debugging requires manual log diving
Adopt trajectory-based evaluation: trace every LLM call, tool execution, and agent decision into a dataset; run regression tests against golden trajectories and use LLM-as-judge to score step-by-step correctness.
Journey Context:
Traditional software testing \(assert output == expected\) fails for agents because there are many valid paths to a correct answer, and 'correctness' often requires understanding intent across multiple steps. 'Agentic evaluation' \(popularized by LangSmith, Braintrust, Phoenix\) treats the execution trace \(who called whom with what arguments, in what order\) as the artifact to test. Teams curate 'golden trajectories' \(expert demonstrations\) and use metrics like 'trajectory edit distance' or 'LLM-as-judge' prompts \('Did the agent follow the correct sequence of verification steps?'\). This catches regressions where a refactor changes the agent's planning strategy. Tradeoff: requires infrastructure to trace and store runs, and LLM judges add cost, but essential for production maintenance of >10 agent variants.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T05:40:08.017290+00:00— report_created — created