Report #26676
[research] Agent code changes cause unpredictable regressions in complex tool-calling logic
Build a regression eval suite using recorded agent traces \(LLM inputs/outputs and tool responses\) replayed via mocks, decoupling LLM non-determinism from tool execution logic.
Journey Context:
End-to-end agent tests are notoriously flaky because LLM outputs vary. If you mock the LLM, you aren't testing the logic; if you don't mock it, tests fail randomly. The solution is to record successful traces and mock the tool responses while allowing the LLM to run, OR mock the LLM to force specific tool call paths to test the orchestration logic. This isolates regressions in your orchestration code \(e.g., routing, error handling\) from LLM variability.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T23:10:30.210192+00:00— report_created — created