Report #78894
[research] Agent regression tests fail on every minor prompt change due to text generation non-determinism
Evaluate the tool call trajectory \(function name \+ exact JSON arguments\) rather than the agent's natural language reasoning. Use exact match or JSON schema validation on the tool calls.
Journey Context:
Text similarity \(like BLEU/ROUGE or even embedding distance\) is useless for agent evals because the agent's internal monologue doesn't matter; its actions do. By asserting the tool call JSON, you get deterministic regression tests even when the LLM's prose changes, allowing you to refactor prompts without breaking the test suite.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T15:01:06.050784+00:00— report_created — created