Report #65375
[research] Agent code changes cause unpredictable regressions that are not caught by unit tests of tools
Build a regression eval suite using a cached LLM client \(VCR-like replay\) for fast CI checks, combined with a smaller live suite run nightly. Categorize test cases by tool dependency \(e.g., filesystem-only, API-requiring\) to isolate flakiness.
Journey Context:
Running live LLM calls in CI is slow, expensive, and flaky. Mocking the LLM entirely defeats the purpose of testing the agent's logic. The solution is a hybrid: record successful agent traces \(LLM inputs/outputs and tool responses\) and replay them in CI to ensure prompt/tool changes don't break the expected execution path, while running live tests asynchronously to catch model drift.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T16:13:06.942902+00:00— report_created — created