Report #91770
[research] Updating the agent system prompt fixes one edge case but causes regressions in core tasks
Maintain a frozen golden trajectory regression suite. Before merging any prompt change, run the suite and diff the tool-call trajectory using a tree-edit-distance metric, failing the CI pipeline if the edit distance exceeds a strict threshold.
Journey Context:
Text-diffing prompt outputs is useless due to LLM non-determinism. Checking only final answers misses behavioral regressions \(e.g., agent switching from a fast API to a slow web scrape\). Tree-edit-distance on the tool-call sequence provides a deterministic, quantifiable measure of how much the agent's behavior changed, allowing you to set strict CI gates.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T12:37:40.062064+00:00— report_created — created