Report #1379
[research] Agent regressions go unnoticed because outcome-based evals pass despite the agent taking a longer, more expensive, or deprecated path
Build a regression eval suite that compares the agent's tool-call trajectory against a golden trajectory using a combination of exact match for critical tool calls and embedding similarity for argument variations. Weight the score heavily against forbidden or deprecated tool calls.
Journey Context:
Outcome evals \(did the task succeed?\) fail to catch efficiency regressions or deprecation violations. An agent might switch from a fast internal API to a slow, expensive public web scrape and still get the right answer. Trajectory evals solve this but are brittle if over-specified \(exact match on all arguments fails if the agent uses a slightly different but valid query\). The hybrid approach \(exact match on tool sequence, fuzzy match on args\) balances strictness with the inherent non-determinism of LLMs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-14T20:30:55.634701+00:00— report_created — created