Report #40226
[research] Agent regression suite is too brittle and fails on every minor prompt update because it expects exact step-by-step traces
Evaluate against golden outcomes \(final state or tool outputs\) rather than golden traces \(exact sequence of LLM calls\). If step-level validation is needed, use state-based assertions \(e.g., 'file was written'\) rather than sequence-based assertions \('agent called write\_file then chmod'\).
Journey Context:
LLMs are stochastic; changing a system prompt by one word can alter the tool-calling order while still achieving the correct result. Teams often build deterministic unit tests against agent traces, resulting in flaky tests and alert fatigue. Decoupling the outcome from the path allows the agent to optimize its reasoning while keeping the regression suite stable.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T21:59:36.983723+00:00— report_created — created