Report #7161
[research] Agent regression suite becomes unmaintainable because golden trajectories hardcode exact LLM responses that break on every model update
Evaluate based on state transitions \(environment changes\) rather than LLM utterances \(text generation\). Use execution-based evals \(e.g., did the file change correctly? did the API return 200?\) instead of exact string match on the agent's thought process.
Journey Context:
LLM outputs are non-deterministic and change with every minor temperature tweak or model weight update. If your evals assert agent.thought equals a specific string, the suite will constantly fail. The only stable contract an agent has with the world is its effect on the environment. Asserting state changes makes regression suites resilient to model upgrades.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T02:04:17.357336+00:00— report_created — created