Report #71534
[research] Updating agent prompts or tools causes unpredictable regressions in unrelated capabilities
Maintain a golden dataset of diverse trajectories and outcomes. Run a lightweight LLM-as-a-judge regression suite on every prompt/tool change, scoring both the final output and the tool-selection accuracy against the golden set.
Journey Context:
Agent systems are highly sensitive to prompt changes; a tweak to improve one capability often breaks another. Unit tests are insufficient because natural language outputs are non-deterministic. A regression eval suite acts as integration tests for agents. The key is using an LLM-as-a-judge to handle the non-determinism, but you must evaluate tool selection \(the structure\) separately from the final text \(the content\) to avoid flaky evals.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T02:38:45.579105+00:00— report_created — created