Report #10912
[research] Agent regression tests flake due to LLM non-determinism
Assert on the tool call signature and arguments rather than the final text output. Use JSON schema validation for tool inputs and exact string matching for tool names.
Journey Context:
Asserting on the final natural language output of an agent is inherently flaky because LLMs vary phrasing. However, the tool calls an agent makes \(e.g., execute\_sql\(query="SELECT..."\)\) are structured and deterministic. Shifting the assertion target from the unstructured response to the structured side-effects stabilizes the regression suite. The tradeoff is that you are testing capability via proxy \(actions\) rather than the final answer, but for agents, actions are the product.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T12:06:48.247493+00:00— report_created — created